SlideShare a Scribd company logo
The Apache Solr Smart Data Ecosystem
Trey Grainger
SVP of Engineering, Lucidworks
DFW Data Science
2017.01.09
Trey Grainger
SVP of Engineering
• Previously Director of Engineering @ CareerBuilder
• MBA, Management of Technology – Georgia Tech
• BA, Computer Science, Business, & Philosophy – Furman University
• Information Retrieval & Web Search - Stanford University
Other fun projects:
• Co-author of Solr in Action, plus numerous research papers
• Frequent conference speaker
• Founder of Celiaccess.com, the gluten-free search engine
• Lucene/Solr contributor
About Me
• Apache Solr Overview
Lucidworks Fusion Overview
• Search & Relevancy
- Keyword Search
- Text Analysis
- Multilingual Text Analysis
• Recommendations (Demo)
• Relevancy Spectrum
• Reflected Intelligence
- Relevancy Tuning
- Learning to Rank (Demo)
- Signals (Demo) …
Agenda
…
• Semantic Search
- Entity Extraction (Demo)
- Query Parsing (Demo)
- Semantic Knowledge Graph (Demo)
• Streaming Expressions
• Solr / Fusion SQL (Demo)
• Solr Graph
DFW Data Science
what do you do?
The Apache Solr Smart Data Ecosystem
Search-Driven
Everything
Customer
Service
Customer
Insights
Fraud Surveillance
Research
Portal
Online Retail
Digital
Content
Lucidworks enables Search-Driven Everything
Data Acquisition
Indexing & Streaming
Smart Access API
Recommendations &
Alerts
Analytics & InsightsExtreme Relevancy
CUSTOMER
SERVICE
RESEARCH
PORTAL
DIGITAL
CONTENT
CUSTOMER
INSIGHTS
FRAUD
SURVEILLANCE
ONLINE
RETAIL
• Access all your data in a
number of ways from one
place.
• Secure storage and
processing from Solr and
Spark.
• Acquire data from any source
with pre-built connectors and
adapters.
Machine learning and
advanced analytics turn all
of your apps into intelligent
data-driven applications.
Apache Solr
“Solr is the popular, blazing-fast,
open source enterprise search
platform built on Apache Lucene™.”
Key Solr Features:
● Multilingual Keyword search
● Relevancy Ranking of results
● Faceting & Analytics (nested / relational)
● Highlighting
● Spelling Correction
● Autocomplete/Type-ahead Prediction
● Sorting, Grouping, Deduplication
● Distributed, Fault-tolerant, Scalable
● Geospatial search
● Complex Function queries
● Recommendations (More Like This)
● Graph Queries and Traversals
● SQL Query Support
● Streaming Aggregations
● Batch and Streaming processing
● Highly Configurable / Plugins
● Learning to Rank
● Building machine-learning models
● … many more
*source: Solr in Action, chapter 2
The standard
for enterprise
search.
of Fortune 500
uses Solr.
90%
Lucidworks Fusion
DFW Data Science
The Apache Solr Smart Data Ecosystem
All Your Data
• Over 50 connectors to
integrate all your data
• Robust parsing framework
to seamlessly ingest all your
document types
• Point and click Indexing
configuration and iterative
simulation of results for full
control over your ETL
process
• Your security model
enforced end-to-end from
ingest to search across your
different datasources
Experience
Management
• Relevancy tuning: Point-and-click
query pipeline configuration allow
fine-grained control of results.
• Machine-driven relevancy:
Signals aggregation learn and
automatically tune relevancy and
drive recommendations out of the
box .
• Powerful pipeline stages:
Customize fields, stages,
synonyms, boosts, facets,
machine learning models, your
own scripted behavior, and
dozens of other powerful search
stages.
• Turnkey search UI
(Lucidworks View): Build a
sophisticated end-to-end search
application in just hours.
Operational Simplicity
SECURITY BUILT-IN
Shards Shards
Apache Solr
Apache Zookeeper
ZK 1
Leader
Election
Load
Balancing
Shared Config
Management
Worker Worker
Apache Spark
Cluster
Manager
Core Services
• • •
NLP
Recommenders / Signals
Blob Storage
Pipelines
Scheduling
Alerting / Messaging
Connectors
RESTAPI
Admin UI
Lucidworks
View
LOGS FILE WEB DATABASE CLOUD
HDFS(Optional)
• 75% decrease in
development time
• Licensing costs cut
by 50%
With Fusion’s out-of-the-box capabilities, we skipped
months in our dev cycle so we could focus our team
where they would have the most impact.
We cut our licensing costs by 50% and improved
application usability. The Lucidworks professional
services team amplified our success even further. We’re
all Fusion from here on out!”
“
Lourduraju Pamishetty
Senior IT Application Architect
—
• Seamless integration of your
entire search & analytics
platform
• All capabilities exposed
through secured API's, so
you can use our UI or build
your own.
• End-to-end security policies
can be applied out of the
box to every aspect of your
search ecosystem.
• Distributed, fault-tolerant
scaling and supervision of
your entire search
application
Core Services
• • •
NLP
Recommenders / Signals
Blob Storage
Pipelines
Scheduling
Alerting / Messaging
Connectors
RESTAPI
Admin UI
Lucidworks
View
LOGS FILE WEB DATABASE CLOUD
• Seamless integration of your
entire search & analytics
platform
• All capabilities exposed
through secured API's, so
you can use our UI or build
your own.
• End-to-end security policies
can be applied out of the
box to every aspect of your
search ecosystem.
• Distributed, fault-tolerant
scaling and supervision of
your entire search
application
Fusion powers search for the brightest companies in the world.
Lucidworks Fusion
search & relevancy
Basic Keyword Search
The beginning of a typical search journey
Term Documents
a doc1 [2x]
brown doc3 [1x] , doc5 [1x]
cat doc4 [1x]
cow doc2 [1x] , doc5 [1x]
… ...
once doc1 [1x], doc5 [1x]
over doc2 [1x], doc3 [1x]
the doc2 [2x], doc3 [2x],
doc4[2x], doc5 [1x]
… …
Document Content Field
doc1 once upon a time, in a land far,
far away
doc2 the cow jumped over the moon.
doc3 the quick brown fox jumped over
the lazy dog.
doc4 the cat in the hat
doc5 The brown cow said “moo”
once.
… …
What you SEND to Lucene/Solr:
How the content is INDEXED into
Lucene/Solr (conceptually):
The inverted index
DFW Data Science
/solr/select/?q=apache solr
Field Documents
… …
apache doc1, doc3, doc4,
doc5
…
hadoop doc2, doc4, doc6
… …
solr doc1, doc3, doc4,
doc7, doc8
… …
doc5
doc7 doc8
doc1 doc3
doc4
solr
apache
apache solr
Matching queries to documents
DFW Data Science
Text Analysis
Generating terms to index from raw text
Text Analysis in Solr
A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFilters
Takes incoming text and “cleans it up”
before it is tokenized
② One Tokenizer
Splits incoming text into a Token Stream
containing Zero or more Tokens
③ Zero or more TokenFilters
Examines and optionally modifies each
Token in the Token Stream
*From Solr in Action, Chapter 6
DFW Data Science
A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFilters
Takes incoming text and “cleans it up”
before it is tokenized
② One Tokenizer
Splits incoming text into a Token Stream
containing Zero or more Tokens
③ Zero or more TokenFilters
Examines and optionally modifies each
Token in the Token Stream
Text Analysis in Solr
*From Solr in Action, Chapter 6
DFW Data Science
A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFilters
Takes incoming text and “cleans it up”
before it is tokenized
② One Tokenizer
Splits incoming text into a Token Stream
containing Zero or more Tokens
③ Zero or more TokenFilters
Examines and optionally modifies each
Token in the Token Stream
Text Analysis in Solr
*From Solr in Action, Chapter 6
DFW Data Science
A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFilters
Takes incoming text and “cleans it up”
before it is tokenized
② One Tokenizer
Splits incoming text into a Token Stream
containing Zero or more Tokens
③ Zero or more TokenFilters
Examines and optionally modifies each
Token in the Token Stream
Text Analysis in Solr
*From Solr in Action, Chapter 6
DFW Data Science
Multi-lingual Text Analysis
Analyzing text across multiple languages
Example English Analysis Chains
<fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
words="lang/stopwords_en.txt”
ignoreCase="true" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="lang/en_protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="lang/en_synonyms.txt" I
ignoreCase="true"
expand="true"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.KStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
DFW Data Science
Per-language Analysis Chains
DFW Data Science
*Some of the 32 different languages configurations in Appendix B of Solr in Action
Per-language Analysis Chains
*Some of the 32 different languages configurations in Appendix B of Solr in Action
DFW Data Science
Which Stemmer do I choose?
*From Solr in Action, Chapter 14
DFW Data Science
Common English Stemmers
DFW Data Science
*From Solr in Action, Chapter 14
When Stemming goes awry
Fixing Stemming Mistakes:
• Unfortunately, every stemmer will have problem-cases that aren’t handled as you
would expect
• Thankfully, Stemmers can be overriden
• KeywordMarkerFilter: protects a list of terms you specify from being stemmed
• StemmerOverrideFilter: applies a list of custom term mappings you specify
Alternate strategy:
• Use Lemmatization (root-form analysis) instead of Stemming
• Commercial vendors help tremendously in this space
• The Hunspell stemmer enables dictionary-based support of varying quality in over
100 languages
DFW Data Science
Relevancy
Scoring the results, returning the best matches
Classic Lucene Relevancy Algorithm (now switched to BM25):
*Source: Solr in Action, chapter 3
Score(q, d) =
∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q)
t in q
Where:
t = term; d = document; q = query; f = field
tf(t in d) = numTermOccurrencesInDocument ½
idf(t) = 1 + log (numDocs / (docFreq + 1))
coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery
queryNorm(q) = 1 / (sumOfSquaredWeights ½ )
sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2
t in q
norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost()
DFW Data Science
• Term Frequency: “How well a term describes a document?”
– Measure: how often a term occurs per document
• Inverse Document Frequency: “How important is a term overall?”
– Measure: how rare the term is across all documents
TF * IDF
*Source: Solr in Action, chapter 3
DFW Data Science
News Search : popularity and freshness drive relevance
Restaurant Search: geographical proximity and price range are critical
Ecommerce: likelihood of a purchase is key
Movie search: More popular titles are generally more relevant
Job search: category of job, salary range, and geographical proximity matter
TF * IDF of keywords can’t hold it’s own against good
domain-specific relevance factors!
That’s great, but what about domain-specific knowledge?
DFW Data Science
John lives in Boston but wants to move to New York or possibly another big city. He is
currently a sales manager but wants to move towards business development.
Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location
in the food service industry.
Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a
Big Data company. He is happy to move across the U.S. for the right job.
Jane is a nurse educator in Boston seeking between $40K and $60K
*Example from chapter 16 of Solr in Action
Consider what you know about users
DFW Data Science
http://localhost:8983/solr/jobs/select/?
fl=jobtitle,city,state,salary&
q=(
jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10
)
AND (
(city:"Boston" AND state:"MA")^15
OR state:"MA")
AND _val_:"map(salary, 40000, 60000,10, 0)”
*Example from chapter 16 of Solr in Action
Query for Jane
Jane is a nurse educator in Boston seeking between $40K and $60K
DFW Data Science
{ ...
"response":{"numFound":22,"start":0,"docs":[
{"jobtitle":" Clinical Educator
(New England/ Boston)",
"city":"Boston",
"state":"MA",
"salary":41503},
…]}}
*Example documents available @ http://github.com/treygrainger/solr-in-action
Search Results for Jane
{"jobtitle":"Nurse Educator",
"city":"Braintree",
"state":"MA",
"salary":56183},
{"jobtitle":"Nurse Educator",
"city":"Brighton",
"state":"MA",
"salary":71359}
DFW Data Science
You just built a
recommendation engine!
Demo: Recommendations
Traditional
Keyword
Search
Recommendations
Semantic
Search
User Intent
Personalized
Search
Augmented
Search
Domain-aware
Matching
The Relevancy
Spectrum
DFW Data Science
Basic Keyword Search
(inverted index, tf-idf, bm25,
query formulation, etc.)
Taxonomies / Entity
Extraction
(entity recognition,
ontologies, synonyms, etc.)
Query Intent
(query classification, semantic
query parsing, concept
expansion, rules, clustering,
classification)
Relevancy Tuning
(signals, AB testing/genetic
algorithms, Learning to Rank,
Neural Networks)
Self-learning
Data-driven App Sophistication
DFW Data Science
what is “reflected intelligence”?
The Three C’s
Content:
Keywords and other features in your documents
Collaboration:
How other’s have chosen to interact with your system
Context:
Available information about your users and their intent
Reflected Intelligence
“Leveraging previous data and interactions to improve how
new data and interactions should be interpreted”
DFW Data Science
Feedback Loops
User
Searches
User
Sees
Results
User
takes an
action
Users’ actions
inform system
improvements
DFW Data Science
● Recommendation Algorithms
● Building user profiles from past searches, clicks, and other actions
● Identifying correlations between keywords/phrases
● Building out automatically-generated ontologies from content and queries
● Determining relevancy judgements (precision, recall, nDCG, etc.) from click
logs
● Learning to Rank - using relevancy judgements and machine learning to train
a relevance model
● Discovering misspellings, synonyms, acronyms, and related keywords
● Disambiguation of keyword phrases with multiple meanings
● Learning what’s important in your content
Examples of Reflected Intelligence
DFW Data Science
Relevancy Tuning
Improving ranking algorithms through experiments and models
How to Measure Relevancy?
A B C
Retrieved
Documents
Related
Documents
Precision = B/A
Recall = B/C
Problem:
Assume Prec = 90% and Rec = 100% but assume the 10% irrelevant documents were ranked at
the top of the retrieved documents, is that OK?
DFW Data Science
Normalized Discounted Cumulative Gain
Rank Relevancy
3 0.95
1 0.70
2 0.60
4 0.45
Rank Relevancy
1 0.95
2 0.85
3 0.80
4 0.65
Ranking
Ideal
Given
• Position is
considered in
quantifying
relevancy.
• Labeled dataset
is required.
DFW Data Science
Learning to Rank
Learning to Rank (LTR)
● It applies machine learning techniques to discover the best combination
of features that provide best ranking.
● It requires labeled set of documents with relevancy scores for given set
of queries
● Features used for ranking are usually more computationally expensive
than the ones used for matching
● It typically re-ranks a subset of the matched documents (e.g. top 1000)
DFW Data Science
DFW Data Science
Common LTR Algorithms
• RankNet* (Neural Network, boosted trees)
• LambdaMart* (set of regression trees)
• SVM Rank** (SVM classifier)
** http://research.microsoft.com/en-us/people/hangli/cao-et-al-sigir2006.pdf
* http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf
DFW Data Science
LambdaMart Example
Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016
DFW Data Science
Demo: Solr Learning to Rank
Obtaining Relevancy Judgements
Typical Methodologies
1) Hire employees, contractors, or interns
-Pros:
Accuracy
-Cons:
Expensive
Not scalable (cost or man-power-wise)
Data Becomes Stale
2) Crowdsource
-Pros:
Less cost, more scalable
-Cons:
Less accurate
Data still becomes stale
Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016
DFW Data Science
Reflected Intelligence: Possible to infer relevancy judgements?
Rank Document ID
1 Doc1
2 Doc2
3 Doc3
4 Doc4
Query
Query
Doc1 Doc2 Doc3
0
1 1
Query
Doc1 Doc2 Doc3
1
0 0
Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016
DFW Data Science
Automated Relevancy Benchmarking
Default Algorithm
0.61
0.59
0.58
0.60
0.61 0.61
0.60
0.61
0.60
0.75
0.74
0.75
0.74
0.75
0.73
0.75
0.76
0.75
0.74
0.79 0.79
0.78
0.79
0.80
0.81
0.80
0.81
0.79 0.79
0.70
0.71 0.71
0.69
0.70 0.70
0.69
0.70
0.71
0.70
0.75
0.76
0.77
0.76 0.76
0.77
0.76
0.75
0.76 0.76
0.30
0.31
0.32
0.33
0.32
0.30
0.31 0.31 0.31
0.32
10/1/16 10/2/16 10/3/16 10/4/16 10/5/16 10/6/16 10/7/16 10/8/16 10/9/16 10/10/16
Default Algorithm Algorithm 1 Algorithm 2 Algorithm3 Algorithm 4 Algorithm 5
DFW Data Science
Demo: Fusion Signals
The Apache Solr Smart Data Ecosystem
• 200%+ increase in
click-through rates
• 91% lower TCO
• Fewer support tickets
• Increased customer
satisfaction
semantic search
DFW Data Science
Building a Taxonomy of Entities
Many ways to generate this:
• Topic Modelling
• Clustering of documents
• Statistical Analysis of interesting phrases
- Word2Vec / Glove / Dice Conceptual Search
• Buy a dictionary (often doesn’t work for
domain-specific search problems)
• Generate a model of domain-specific phrases by
mining query logs for commonly searched phrases within the domain*
* K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
DFW Data Science
DFW Data Science
DFW Data Science
entity extraction
DFW Data Science
Demo: Solr Text Tagger
semantic query parsing
DFW Data Science
Probabilistic Query Parser
Goal: given a query, predict which
combinations of keywords should be
combined together as phrases
Example:
senior java developer hadoop
Possible Parsings:
senior, java, developer, hadoop
"senior java", developer, hadoop
"senior java developer", hadoop
"senior java developer hadoop”
"senior java", "developer hadoop”
senior, "java developer", hadoop
senior, java, "developer hadoop" Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization,
and Contextual Disambiguation”, Bay Area Search Meetup, November 2015.
DFW Data Science
Demo: Probabilistic Query Parser
Semantic Query Parsing
Identification of phrases in queries using two steps:
1) Check a dictionary of known terms that is continuously
built, cleaned, and refined based upon common inputs from
interactions with real users of the system. The SolrTextTagger
works well for this.*
2) Also invoke a probabilistic query parser to dynamically
identify unknown phrases using statistics from a corpus of data
(language model)
*K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation
through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
DFW Data Science
query augmentation
DFW Data Science
Knowledge
Graph
Semantic Data Encoded into Free Text Content
e en eng engi engineer engineers
engineer engineersNode Type: Term
software
engineer
software
engineers
electrical
engineering
engineer
engineering software
…
…
…
Node Type:
Character Sequence
Node Type:
Term Sequence
Node Type:
Document
id: 1
text: looking for a software
engineerwith degree in
computer science or
electrical engineering
id: 2
text: apply to be a software
engineer and work with
other great software
engineers
id: 3
text: start a great careerin
electrical engineering
…
…
DFW Data Science
id: 1
job_title: Software Engineer
desc: software engineer at a
great company
skills: .Net, C#, java
id: 2
job_title: Registered Nurse
desc: a registered nurse at
hospital doing hard work
skills: oncology, phlebotemy
id: 3
job_title: Java Developer
desc: a software engineer or a
java engineer doing work
skills: java, scala, hibernate
field term postings list
doc pos
desc
a
1 4
2 1
3 1, 5
at
1 3
2 4
company 1 6
doing
2 6
3 8
engineer
1 2
3 3, 7
great 1 5
hard 2 7
hospital 2 5
java 3 6
nurse 2 3
or 3 4
registered 2 2
software
1 1
3 2
work
2 10
3 9
job_title java developer 3 1
… … … …
field doc term
desc
1
a
at
company
engineer
great
software
2
a
at
doing
hard
hospital
nurse
registered
work
3
a
doing
engineer
java
or
software
work
job_title 1
Software
Engineer
… … …
Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
DFW Data Science
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Set-theory View
Graph View
How the Graph Traversal Works
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
Data Structure View
Java
Scala Hibernate
docs
1, 2, 6
docs
3, 4
Oncology
doc 5
DFW Data Science
Knowledge
Graph
Graph Model
Structure:
Single-level Traversal / Scoring:
Multi-level Traversal / Scoring:
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Multi-level Traversal
Data Structure View
Graph View
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill:
Java
skill: Java
skill: Scala
skill:
Hibernate
skill:
Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
job_title:
Software
Engineer
job_title:
Data
Scientist
job_title:
Java
Developer
……
Inverted Index
Lookup
Forward Index
Lookup
Forward Index
Lookup
Inverted Index
Lookup
Java
Java
Developer
Hibernate
Scala
Software
Engineer
Data
Scientist
has_related_job_title
has_related_job_title
DFW Data Science
Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.
Knowledge
Graph
Scoring nodes in the Graph
Foreground vs. Background Analysis
Every term scored against it’s context. The more
commonly the term appears within it’s foreground
context versus its background context, the more
relevant it is to the specified foreground context.
countFG(x) - totalDocsFG * probBG(x)
z = --------------------------------------------------------
sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))
{ "type":"keywords”, "values":[
{ "value":"hive", "relatedness": 0.9765, "popularity":369 },
{ "value":"spark", "relatedness": 0.9634, "popularity":15653 },
{ "value":".net", "relatedness": 0.5417, "popularity":17683 },
{ "value":"bogus_word", "relatedness": 0.0, "popularity":0 },
{ "value":"teaching", "relatedness": -0.1510, "popularity":9923 },
{ "value":"CPR", "relatedness": -0.4012, "popularity":27089 } ] }
+
-
Foreground Query:
"Hadoop"
DFW Data Science
Source: Trey Grainger,
Khalifeh AlJadda, Mohammed
Korayem, Andries Smith.“The
Semantic Knowledge Graph: A
compact, auto-generated
model for real-time traversal
and ranking of any relationship
within a domain”. DSAA 2016.
Knowledge
Graph
Multi-level Graph Traversal with Scores
software engineer*
(materialized node)
Java
C#
.NET
.NET
Developer
Java
Developer
Hibernate
ScalaVB.NET
Software
Engineer
Data
Scientist
Skill
Nodes
has_related_skillStarting
Node
Skill
Nodes
has_related_skill Job Title
Nodes
has_related_job_title
0.90
0.88 0.93
0.93
0.34
0.74
0.91
0.89
0.74
0.89
0.780.72
0.48
0.93
0.76
0.83
0.80
0.64
0.61
0.780.55
DFW Data Science
Knowledge
Graph
Use Case: Document Summarization
Experiment: Pass in raw text
(extracting phrases as needed), and
rank their similarity to the documents
using the SKG.
Additionally, can traverse the graph
to “related” entities/keyword phrases
NOT found in the original document
Applications: Content-based and
multi-modal recommendations
(no cold-start problem), data cleansing
prior to clustering or other ML methods,
semantic search / similarity scoring
Demo: Semantic Knowledge Graph
Knowledge
Graph
DFW Data Science
Knowledge
Graph
DFW Data Science
DFW Data Science
streaming expressions
• Perform relational operations on
streams
• Stream sources: search, jdbc, facets,
features, gatherNodes, shortestPath,
train, features, model, random, stats,
topic
• Stream decorators: classify, commit,
complement, daemon, executor, fetch,
having, leftOuterJoin, hashJoin,
innerJoin, intersect, merge, null,
outerHashJoin, parallel, priority,
reduce, rollup, scoreNodes, select,
sort, top, unique, update
Streaming Expressions
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
DFW Data Science
• Relies on docValues (column-oriented data
structure) and /export handler
• Extreme read performance (8-10x faster than
queries using cursorMark)
• Facet or map/reduce style aggregation modes
• Tiered architecture
• SQL interface tier
• Worker tier (scale a pool of worker “nodes”
independently of the data collection)
• Data tier (Solr collection)
Streaming API: Nuts and Bolts
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
DFW Data Science
Streaming Expressions - Examples
Shortest-path Graph
Traversal
Parallel Batch
Procesing
Train a Logistic Regression
Model
Distributed Joins
Rapid Export of all
Search Results
Pull Results from External Database
Sources: https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html
Classifying
Search Results
Solr SQL
• SQL is ubiquitous language for analytics
• People: Less training and easier to understand
• Tools! Solr as JDBC data source (DbVisualizer,
Apache Zeppelin, and SQuirreL SQL)
• Query planning / optimization can evolve
iteratively
SQL is natural extension for Solr’s parallel computing engine
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
DFW Data Science
Give me the top 5 action movies with rating of 4 or better
Mental Warm-up
/select?q=*:*
&fq=genre_ss:action
&fq=rating_i:[4 TO *]
&facet=true
&facet.limit=5
&facet.mincount=1
&facet.field=title_s
SELECT title_s, COUNT(*) as cnt
FROM movielens
WHERE genre_ss='action'
AND rating_i='[4 TO *]’
GROUP BY title_s
ORDER BY cnt desc
LIMIT 5
{ ...
"facet_counts":{
"facet_fields":{
"title_s":[
"Star Wars (1977)",501,
"Return of the Jedi (1983)",379,
"Godfather, The (1972)",351,
"Raiders of the Lost Ark (1981)",348,
"Empire Strikes Back, The (1980)",293]},
...}}
{"result-set":{"docs":[
{"title_s":"Star Wars (1977)”,"cnt":501},
{"title_s":"Return of the Jedi (1983)","cnt":379},
{"title_s":"Godfather, The (1972)","cnt":351},
{"title_s":"Raiders of the Lost Ark (1981)","cnt":348},
{"title_s":"Empire Strikes Back, The (1980)","cnt":293},
{"EOF":true,"RESPONSE_TIME":42}]}}
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
DFW Data Science
SELECT gender_s, COUNT(*) as num_ratings, avg(rating_i) as avg_rating
FROM movielens
WHERE genre_ss='romance' AND age_i='[30 TO *]'
GROUP BY gender_s
ORDER BY num_ratings desc
SQL Examples
SELECT title_s, genre_s, COUNT(*) as num_ratings, avg(rating_i) as avg_rating
FROM movielens
GROUP BY title_s, genre_s
HAVING num_ratings >= 100
ORDER BY avg_rating desc
LIMIT 5
SELECT DISTINCT(user_id_i) as user_id
FROM movielens
WHERE genre_ss='documentary'
ORDER BY user_id desc
Give me the avg rating for men
and women over 30 for
romance movies
Give me the top 5 rated movies
with at least 100 ratings
Give me the set of unique users
that have rated documentaries
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
parallel(workers,
hashJoin(
search(movielens, q=*:*,
fl="user_id_i,movie_id_i,rating_i",
sort="movie_id_i asc",
partitionKeys="movie_id_i"),
hashed=search(movielens_movies, q=*:*,
fl="movie_id_i,title_s,genre_s",
sort="movie_id_i asc",
partitionKeys="movie_id_i"),
on="movie_id_i"
),
workers="4",
sort="movie_id_i asc"
)
Streaming Expression Example: hashJoin
The small “right” side of the join
gets loaded into memory on
each worker node
Each shard queried by N
workers, so 4 workers x 4
shards means 16 queries
(usually all replicas per shard
are hit)
Workers collection isolates parallel
computation nodes from data nodes
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
• spark-solr project uses streaming API to pull data
from Solr into Spark jobs if docValues enabled,
see: https://github.com/lucidworks/spark-solr
• Perform aggregations of “signals”, e.g clicks, to
compute boosts and recommendations using
Spark
• Custom Scala script jobs to perform complex
analysis on data in Solr, e.g. sessionize request
logs
• Power rich data visualizations using Fusion’s SQL
Engine powered by SparkSQL + Solr streaming
aggregations
How we use Solr streaming API in Fusion
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
DFW Data Science
DFW Data Science
Comparing SQL Capabilities
Fusion Solr Hive Drill SparkSQL
Secret Sauce
SparkSQL Benefits
+ Solr Benefits
+ Enterprise Security
Push complex
query constructs
into engine (full
text, spatial,
relevancy, graph,
functions, etc)
Mature SQL solution
for Hadoop stack
Execute SQL over
NoSQL data sources
Spark core
(optimized shuffle,
in-memory, etc),
integration of other
APIs: ML,
Streaming, GraphX
SQL Features Maturing Evolving Mature Maturing Maturing
Scaling
Linear (shards and
replicas) backed by
inverted index;
Linear (shards and
replicas) backed
by inverted index
Limited by Hadoop
infrastructure (table
scans)
Good, but need to
benchmark
Memory intensive;
Scale out using
Spark cluster,
backed by RDDs
Integration w/
external systems
Analytics Catalog API,
JDCB Driver, ODBC
Bridge
JDBC stream
source
external tables /
plugin API
many drivers
available
DataSource API,
many systems
supported
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
Demo: Fusion SQL Engine
Graph
Graph Use Cases
• Anomaly detection /
fraud detection
• Recommenders
• Social network analysis
• Graph Search
• Access Control
• Relationship discovery / scoring
Examples
o Find all draft blog posts about “Parallel SQL”
written by a developer
o Find all tweets mentioning “Solr” by me or people
I follow
o Find all draft blog posts about “Parallel SQL”
written by a developer
o Find 3-star hotels in NYC my friends stayed in
last year
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
Solr Graph Timeline
• Some data is much more naturally represented as a graph
structure
• Solr 6.0: Introduced the Graph Query Parser
• Solr 6.1: Introduced Graph Streaming expressions
…
• Solr 6.3: Current Version
• TBD: Semantic Knowledge Graph (patch available)
DFW Data Science
Graph Query Parser
• Query-time, cyclic aware graph traversal is able to rank documents based on relationships
• Provides controls for depth, filtering of results and inclusion
of root and/or leaves
• Limitations: single node/shard only
Examples:
• http://localhost:8983/solr/graph/query?fl=id,score&
q={!graph from=in_edge to=out_edge}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&
q={!graph from=in_edge to=out_edge
traversalFilter='foo:[* TO 15]'}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&
q={!graph from=in_edge to=out_edge maxDepth=1}foo:[* TO 10]
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
Graph Streaming Expressions
• Part of Solr’s broader Streaming Expressions capability
• Implements a powerful, breadth-first traversal
• Works across shards AND collections
• Supports aggregations
• Cycle aware
curl -X POST -H "Content-Type: application/x-www-form-urlencoded"
-d ‘expr=…’"http://localhost:18984/solr/movielens/stream"
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
All movies that user 389 watched
expr:gatherNodes(movielens,walk="389->user_id_i",gather="movie_id_i")
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
All movies that viewers of a specific movie watched
expr:gatherNodes(movielens,
gatherNodes(movielens,walk="161->movie_id_i",gather="user_id_i"),
walk="node->user_id_i",gather="movie_id_i", trackTraversal="true"
)
Movie 161: “The Air Up There”
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
Collaborative Filtering
expr=top(n="5", sort="count(*) desc",
gatherNodes(movielens,
top(n="30", sort="count(*) desc",
gatherNodes(movielens,
search(movielens, q="user_id_i:305", fl="movie_id_i",
sort="movie_id_i asc", qt=“/export"),
walk="movie_id_i->movie_id_i", gather="user_id_i",
maxDocFreq="10000", count(*)
)
),
walk="node->user_id_i", gather="movie_id_i", count(*)
)
)
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
Comparing Graph Choices
Solr Elastic Graph Neo4J
Spark
GraphX
Best Use Case
QParser: predef.
relationships as filters
Expressions: fast,
query-based, dist.
graph ops
Limited to sequential,
term relatedness
exploration only
Graph ops and
querying that fit on a
single node
Large-scale, iterative
graph ops
Common Graph
Algorithms (e.g.
Pregel, Traversal)
Partial No Yes Yes
Scaling
QParser: Co-located
Shards only
Expressions: Yes
Yes Master/Replica Yes
Commercial
License Required
No Yes GPLv3 No
Visualizations
GraphML support
(e.g. Gephi)
Kibana Neo4j browser 3rd party
DFW Data Science
Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
Additional References:
DFW Data Science
Contact Info
Trey Grainger
trey.grainger@lucidworks.com
@treygrainger
http://solrinaction.com
Meetup discount (39% off): 39grainger
Other presentations:
http://www.treygrainger.com
DFW Data Science

More Related Content

What's hot (20)

PDF
Thought Vectors and Knowledge Graphs in AI-powered Search
Trey Grainger
 
PPTX
Building Search & Recommendation Engines
Trey Grainger
 
PPT
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Trey Grainger
 
PPTX
How to Build a Semantic Search System
Trey Grainger
 
PDF
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 
PPTX
The Relevance of the Apache Solr Semantic Knowledge Graph
Trey Grainger
 
PDF
Natural Language Search with Knowledge Graphs (Haystack 2019)
Trey Grainger
 
PPTX
Searching for Meaning
Trey Grainger
 
PPTX
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Lucidworks
 
PDF
Natural Language Search with Knowledge Graphs (Activate 2019)
Trey Grainger
 
PDF
The Next Generation of AI-powered Search
Trey Grainger
 
PPTX
South Big Data Hub: Text Data Analysis Panel
Trey Grainger
 
PPTX
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Trey Grainger
 
PDF
Semantic & Multilingual Strategies in Lucene/Solr
Trey Grainger
 
PDF
The Future of Search and AI
Trey Grainger
 
PDF
Balancing the Dimensions of User Intent
Trey Grainger
 
PDF
Measuring Relevance in the Negative Space
Trey Grainger
 
PDF
AI, Search, and the Disruption of Knowledge Management
Trey Grainger
 
PPTX
The Semantic Knowledge Graph
Trey Grainger
 
PDF
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Lucidworks
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Trey Grainger
 
Building Search & Recommendation Engines
Trey Grainger
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Trey Grainger
 
How to Build a Semantic Search System
Trey Grainger
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 
The Relevance of the Apache Solr Semantic Knowledge Graph
Trey Grainger
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Trey Grainger
 
Searching for Meaning
Trey Grainger
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Lucidworks
 
Natural Language Search with Knowledge Graphs (Activate 2019)
Trey Grainger
 
The Next Generation of AI-powered Search
Trey Grainger
 
South Big Data Hub: Text Data Analysis Panel
Trey Grainger
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Trey Grainger
 
Semantic & Multilingual Strategies in Lucene/Solr
Trey Grainger
 
The Future of Search and AI
Trey Grainger
 
Balancing the Dimensions of User Intent
Trey Grainger
 
Measuring Relevance in the Negative Space
Trey Grainger
 
AI, Search, and the Disruption of Knowledge Management
Trey Grainger
 
The Semantic Knowledge Graph
Trey Grainger
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Lucidworks
 

Similar to The Apache Solr Smart Data Ecosystem (20)

PDF
Getting started faster with LucidWorks for Solr
Lucidworks (Archived)
 
PDF
Data Science with Solr and Spark
Lucidworks
 
PPTX
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
PDF
Suche mit Apache Lucene & Co.
inovex GmbH
 
PDF
Apache Solr crash course
Tommaso Teofili
 
PDF
Webinar: Rapid Solr Development with Fusion
Lucidworks
 
PDF
Webinar: Fusion 3.1 - What's New
Lucidworks
 
PDF
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Kai Chan
 
PDF
Get the most out of Solr search with PHP
Paul Borgermans
 
PDF
Data Engineering with Solr and Spark
Lucidworks
 
PDF
Introduction to Solr
Erik Hatcher
 
PDF
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
lucenerevolution
 
PDF
KEYNOTE: Lucene / Solr road map
lucenerevolution
 
PPTX
Introduction to Apache Lucene/Solr
Rahul Jain
 
PDF
Webinar: Building Conversational Search with Fusion
Lucidworks
 
PDF
Find it, possibly also near you!
Paul Borgermans
 
PDF
Webinar: Site Search in an Hour with Fusion
Lucidworks
 
PPTX
Apache solr
Dipen Rangwani
 
PDF
Apace Solr Web Development.pdf
Ayesha Siddika
 
PDF
Introduction to Solr
Erik Hatcher
 
Getting started faster with LucidWorks for Solr
Lucidworks (Archived)
 
Data Science with Solr and Spark
Lucidworks
 
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Suche mit Apache Lucene & Co.
inovex GmbH
 
Apache Solr crash course
Tommaso Teofili
 
Webinar: Rapid Solr Development with Fusion
Lucidworks
 
Webinar: Fusion 3.1 - What's New
Lucidworks
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Kai Chan
 
Get the most out of Solr search with PHP
Paul Borgermans
 
Data Engineering with Solr and Spark
Lucidworks
 
Introduction to Solr
Erik Hatcher
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
lucenerevolution
 
KEYNOTE: Lucene / Solr road map
lucenerevolution
 
Introduction to Apache Lucene/Solr
Rahul Jain
 
Webinar: Building Conversational Search with Fusion
Lucidworks
 
Find it, possibly also near you!
Paul Borgermans
 
Webinar: Site Search in an Hour with Fusion
Lucidworks
 
Apache solr
Dipen Rangwani
 
Apace Solr Web Development.pdf
Ayesha Siddika
 
Introduction to Solr
Erik Hatcher
 
Ad

More from Trey Grainger (6)

PDF
Reflected Intelligence: Real world AI in Digital Transformation
Trey Grainger
 
PDF
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Trey Grainger
 
PPTX
The Apache Solr Semantic Knowledge Graph
Trey Grainger
 
PDF
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Trey Grainger
 
PDF
Building a real time big data analytics platform with solr
Trey Grainger
 
PPTX
Building a real time, solr-powered recommendation engine
Trey Grainger
 
Reflected Intelligence: Real world AI in Digital Transformation
Trey Grainger
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Trey Grainger
 
The Apache Solr Semantic Knowledge Graph
Trey Grainger
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Trey Grainger
 
Building a real time big data analytics platform with solr
Trey Grainger
 
Building a real time, solr-powered recommendation engine
Trey Grainger
 
Ad

Recently uploaded (20)

PPTX
Designing Production-Ready AI Agents
Kunal Rai
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Designing Production-Ready AI Agents
Kunal Rai
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
July Patch Tuesday
Ivanti
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 

The Apache Solr Smart Data Ecosystem

  • 1. The Apache Solr Smart Data Ecosystem Trey Grainger SVP of Engineering, Lucidworks DFW Data Science 2017.01.09
  • 2. Trey Grainger SVP of Engineering • Previously Director of Engineering @ CareerBuilder • MBA, Management of Technology – Georgia Tech • BA, Computer Science, Business, & Philosophy – Furman University • Information Retrieval & Web Search - Stanford University Other fun projects: • Co-author of Solr in Action, plus numerous research papers • Frequent conference speaker • Founder of Celiaccess.com, the gluten-free search engine • Lucene/Solr contributor About Me
  • 3. • Apache Solr Overview Lucidworks Fusion Overview • Search & Relevancy - Keyword Search - Text Analysis - Multilingual Text Analysis • Recommendations (Demo) • Relevancy Spectrum • Reflected Intelligence - Relevancy Tuning - Learning to Rank (Demo) - Signals (Demo) … Agenda … • Semantic Search - Entity Extraction (Demo) - Query Parsing (Demo) - Semantic Knowledge Graph (Demo) • Streaming Expressions • Solr / Fusion SQL (Demo) • Solr Graph DFW Data Science
  • 7. Lucidworks enables Search-Driven Everything Data Acquisition Indexing & Streaming Smart Access API Recommendations & Alerts Analytics & InsightsExtreme Relevancy CUSTOMER SERVICE RESEARCH PORTAL DIGITAL CONTENT CUSTOMER INSIGHTS FRAUD SURVEILLANCE ONLINE RETAIL • Access all your data in a number of ways from one place. • Secure storage and processing from Solr and Spark. • Acquire data from any source with pre-built connectors and adapters. Machine learning and advanced analytics turn all of your apps into intelligent data-driven applications.
  • 9. “Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™.”
  • 10. Key Solr Features: ● Multilingual Keyword search ● Relevancy Ranking of results ● Faceting & Analytics (nested / relational) ● Highlighting ● Spelling Correction ● Autocomplete/Type-ahead Prediction ● Sorting, Grouping, Deduplication ● Distributed, Fault-tolerant, Scalable ● Geospatial search ● Complex Function queries ● Recommendations (More Like This) ● Graph Queries and Traversals ● SQL Query Support ● Streaming Aggregations ● Batch and Streaming processing ● Highly Configurable / Plugins ● Learning to Rank ● Building machine-learning models ● … many more *source: Solr in Action, chapter 2
  • 11. The standard for enterprise search. of Fortune 500 uses Solr. 90%
  • 16. • Over 50 connectors to integrate all your data • Robust parsing framework to seamlessly ingest all your document types • Point and click Indexing configuration and iterative simulation of results for full control over your ETL process • Your security model enforced end-to-end from ingest to search across your different datasources
  • 18. • Relevancy tuning: Point-and-click query pipeline configuration allow fine-grained control of results. • Machine-driven relevancy: Signals aggregation learn and automatically tune relevancy and drive recommendations out of the box . • Powerful pipeline stages: Customize fields, stages, synonyms, boosts, facets, machine learning models, your own scripted behavior, and dozens of other powerful search stages. • Turnkey search UI (Lucidworks View): Build a sophisticated end-to-end search application in just hours.
  • 20. SECURITY BUILT-IN Shards Shards Apache Solr Apache Zookeeper ZK 1 Leader Election Load Balancing Shared Config Management Worker Worker Apache Spark Cluster Manager Core Services • • • NLP Recommenders / Signals Blob Storage Pipelines Scheduling Alerting / Messaging Connectors RESTAPI Admin UI Lucidworks View LOGS FILE WEB DATABASE CLOUD HDFS(Optional)
  • 21. • 75% decrease in development time • Licensing costs cut by 50% With Fusion’s out-of-the-box capabilities, we skipped months in our dev cycle so we could focus our team where they would have the most impact. We cut our licensing costs by 50% and improved application usability. The Lucidworks professional services team amplified our success even further. We’re all Fusion from here on out!” “ Lourduraju Pamishetty Senior IT Application Architect —
  • 22. • Seamless integration of your entire search & analytics platform • All capabilities exposed through secured API's, so you can use our UI or build your own. • End-to-end security policies can be applied out of the box to every aspect of your search ecosystem. • Distributed, fault-tolerant scaling and supervision of your entire search application
  • 23. Core Services • • • NLP Recommenders / Signals Blob Storage Pipelines Scheduling Alerting / Messaging Connectors RESTAPI Admin UI Lucidworks View LOGS FILE WEB DATABASE CLOUD • Seamless integration of your entire search & analytics platform • All capabilities exposed through secured API's, so you can use our UI or build your own. • End-to-end security policies can be applied out of the box to every aspect of your search ecosystem. • Distributed, fault-tolerant scaling and supervision of your entire search application
  • 24. Fusion powers search for the brightest companies in the world.
  • 27. Basic Keyword Search The beginning of a typical search journey
  • 28. Term Documents a doc1 [2x] brown doc3 [1x] , doc5 [1x] cat doc4 [1x] cow doc2 [1x] , doc5 [1x] … ... once doc1 [1x], doc5 [1x] over doc2 [1x], doc3 [1x] the doc2 [2x], doc3 [2x], doc4[2x], doc5 [1x] … … Document Content Field doc1 once upon a time, in a land far, far away doc2 the cow jumped over the moon. doc3 the quick brown fox jumped over the lazy dog. doc4 the cat in the hat doc5 The brown cow said “moo” once. … … What you SEND to Lucene/Solr: How the content is INDEXED into Lucene/Solr (conceptually): The inverted index DFW Data Science
  • 29. /solr/select/?q=apache solr Field Documents … … apache doc1, doc3, doc4, doc5 … hadoop doc2, doc4, doc6 … … solr doc1, doc3, doc4, doc7, doc8 … … doc5 doc7 doc8 doc1 doc3 doc4 solr apache apache solr Matching queries to documents DFW Data Science
  • 30. Text Analysis Generating terms to index from raw text
  • 31. Text Analysis in Solr A text field in Lucene/Solr has an Analyzer containing: ① Zero or more CharFilters Takes incoming text and “cleans it up” before it is tokenized ② One Tokenizer Splits incoming text into a Token Stream containing Zero or more Tokens ③ Zero or more TokenFilters Examines and optionally modifies each Token in the Token Stream *From Solr in Action, Chapter 6 DFW Data Science
  • 32. A text field in Lucene/Solr has an Analyzer containing: ① Zero or more CharFilters Takes incoming text and “cleans it up” before it is tokenized ② One Tokenizer Splits incoming text into a Token Stream containing Zero or more Tokens ③ Zero or more TokenFilters Examines and optionally modifies each Token in the Token Stream Text Analysis in Solr *From Solr in Action, Chapter 6 DFW Data Science
  • 33. A text field in Lucene/Solr has an Analyzer containing: ① Zero or more CharFilters Takes incoming text and “cleans it up” before it is tokenized ② One Tokenizer Splits incoming text into a Token Stream containing Zero or more Tokens ③ Zero or more TokenFilters Examines and optionally modifies each Token in the Token Stream Text Analysis in Solr *From Solr in Action, Chapter 6 DFW Data Science
  • 34. A text field in Lucene/Solr has an Analyzer containing: ① Zero or more CharFilters Takes incoming text and “cleans it up” before it is tokenized ② One Tokenizer Splits incoming text into a Token Stream containing Zero or more Tokens ③ Zero or more TokenFilters Examines and optionally modifies each Token in the Token Stream Text Analysis in Solr *From Solr in Action, Chapter 6 DFW Data Science
  • 35. Multi-lingual Text Analysis Analyzing text across multiple languages
  • 36. Example English Analysis Chains <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt” ignoreCase="true" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="lang/en_protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType> <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="lang/en_synonyms.txt" I ignoreCase="true" expand="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.KStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> DFW Data Science
  • 37. Per-language Analysis Chains DFW Data Science *Some of the 32 different languages configurations in Appendix B of Solr in Action
  • 38. Per-language Analysis Chains *Some of the 32 different languages configurations in Appendix B of Solr in Action DFW Data Science
  • 39. Which Stemmer do I choose? *From Solr in Action, Chapter 14 DFW Data Science
  • 40. Common English Stemmers DFW Data Science *From Solr in Action, Chapter 14
  • 41. When Stemming goes awry Fixing Stemming Mistakes: • Unfortunately, every stemmer will have problem-cases that aren’t handled as you would expect • Thankfully, Stemmers can be overriden • KeywordMarkerFilter: protects a list of terms you specify from being stemmed • StemmerOverrideFilter: applies a list of custom term mappings you specify Alternate strategy: • Use Lemmatization (root-form analysis) instead of Stemming • Commercial vendors help tremendously in this space • The Hunspell stemmer enables dictionary-based support of varying quality in over 100 languages DFW Data Science
  • 42. Relevancy Scoring the results, returning the best matches
  • 43. Classic Lucene Relevancy Algorithm (now switched to BM25): *Source: Solr in Action, chapter 3 Score(q, d) = ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q) t in q Where: t = term; d = document; q = query; f = field tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery queryNorm(q) = 1 / (sumOfSquaredWeights ½ ) sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2 t in q norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost() DFW Data Science
  • 44. • Term Frequency: “How well a term describes a document?” – Measure: how often a term occurs per document • Inverse Document Frequency: “How important is a term overall?” – Measure: how rare the term is across all documents TF * IDF *Source: Solr in Action, chapter 3 DFW Data Science
  • 45. News Search : popularity and freshness drive relevance Restaurant Search: geographical proximity and price range are critical Ecommerce: likelihood of a purchase is key Movie search: More popular titles are generally more relevant Job search: category of job, salary range, and geographical proximity matter TF * IDF of keywords can’t hold it’s own against good domain-specific relevance factors! That’s great, but what about domain-specific knowledge? DFW Data Science
  • 46. John lives in Boston but wants to move to New York or possibly another big city. He is currently a sales manager but wants to move towards business development. Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location in the food service industry. Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a Big Data company. He is happy to move across the U.S. for the right job. Jane is a nurse educator in Boston seeking between $40K and $60K *Example from chapter 16 of Solr in Action Consider what you know about users DFW Data Science
  • 47. http://localhost:8983/solr/jobs/select/? fl=jobtitle,city,state,salary& q=( jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10 ) AND ( (city:"Boston" AND state:"MA")^15 OR state:"MA") AND _val_:"map(salary, 40000, 60000,10, 0)” *Example from chapter 16 of Solr in Action Query for Jane Jane is a nurse educator in Boston seeking between $40K and $60K DFW Data Science
  • 48. { ... "response":{"numFound":22,"start":0,"docs":[ {"jobtitle":" Clinical Educator (New England/ Boston)", "city":"Boston", "state":"MA", "salary":41503}, …]}} *Example documents available @ http://github.com/treygrainger/solr-in-action Search Results for Jane {"jobtitle":"Nurse Educator", "city":"Braintree", "state":"MA", "salary":56183}, {"jobtitle":"Nurse Educator", "city":"Brighton", "state":"MA", "salary":71359} DFW Data Science
  • 49. You just built a recommendation engine!
  • 52. Basic Keyword Search (inverted index, tf-idf, bm25, query formulation, etc.) Taxonomies / Entity Extraction (entity recognition, ontologies, synonyms, etc.) Query Intent (query classification, semantic query parsing, concept expansion, rules, clustering, classification) Relevancy Tuning (signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks) Self-learning Data-driven App Sophistication DFW Data Science
  • 53. what is “reflected intelligence”?
  • 54. The Three C’s Content: Keywords and other features in your documents Collaboration: How other’s have chosen to interact with your system Context: Available information about your users and their intent Reflected Intelligence “Leveraging previous data and interactions to improve how new data and interactions should be interpreted” DFW Data Science
  • 55. Feedback Loops User Searches User Sees Results User takes an action Users’ actions inform system improvements DFW Data Science
  • 56. ● Recommendation Algorithms ● Building user profiles from past searches, clicks, and other actions ● Identifying correlations between keywords/phrases ● Building out automatically-generated ontologies from content and queries ● Determining relevancy judgements (precision, recall, nDCG, etc.) from click logs ● Learning to Rank - using relevancy judgements and machine learning to train a relevance model ● Discovering misspellings, synonyms, acronyms, and related keywords ● Disambiguation of keyword phrases with multiple meanings ● Learning what’s important in your content Examples of Reflected Intelligence DFW Data Science
  • 57. Relevancy Tuning Improving ranking algorithms through experiments and models
  • 58. How to Measure Relevancy? A B C Retrieved Documents Related Documents Precision = B/A Recall = B/C Problem: Assume Prec = 90% and Rec = 100% but assume the 10% irrelevant documents were ranked at the top of the retrieved documents, is that OK? DFW Data Science
  • 59. Normalized Discounted Cumulative Gain Rank Relevancy 3 0.95 1 0.70 2 0.60 4 0.45 Rank Relevancy 1 0.95 2 0.85 3 0.80 4 0.65 Ranking Ideal Given • Position is considered in quantifying relevancy. • Labeled dataset is required. DFW Data Science
  • 61. Learning to Rank (LTR) ● It applies machine learning techniques to discover the best combination of features that provide best ranking. ● It requires labeled set of documents with relevancy scores for given set of queries ● Features used for ranking are usually more computationally expensive than the ones used for matching ● It typically re-ranks a subset of the matched documents (e.g. top 1000) DFW Data Science
  • 63. Common LTR Algorithms • RankNet* (Neural Network, boosted trees) • LambdaMart* (set of regression trees) • SVM Rank** (SVM classifier) ** http://research.microsoft.com/en-us/people/hangli/cao-et-al-sigir2006.pdf * http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf DFW Data Science
  • 64. LambdaMart Example Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016 DFW Data Science
  • 66. Obtaining Relevancy Judgements Typical Methodologies 1) Hire employees, contractors, or interns -Pros: Accuracy -Cons: Expensive Not scalable (cost or man-power-wise) Data Becomes Stale 2) Crowdsource -Pros: Less cost, more scalable -Cons: Less accurate Data still becomes stale Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016 DFW Data Science
  • 67. Reflected Intelligence: Possible to infer relevancy judgements? Rank Document ID 1 Doc1 2 Doc2 3 Doc3 4 Doc4 Query Query Doc1 Doc2 Doc3 0 1 1 Query Doc1 Doc2 Doc3 1 0 0 Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016 DFW Data Science
  • 68. Automated Relevancy Benchmarking Default Algorithm 0.61 0.59 0.58 0.60 0.61 0.61 0.60 0.61 0.60 0.75 0.74 0.75 0.74 0.75 0.73 0.75 0.76 0.75 0.74 0.79 0.79 0.78 0.79 0.80 0.81 0.80 0.81 0.79 0.79 0.70 0.71 0.71 0.69 0.70 0.70 0.69 0.70 0.71 0.70 0.75 0.76 0.77 0.76 0.76 0.77 0.76 0.75 0.76 0.76 0.30 0.31 0.32 0.33 0.32 0.30 0.31 0.31 0.31 0.32 10/1/16 10/2/16 10/3/16 10/4/16 10/5/16 10/6/16 10/7/16 10/8/16 10/9/16 10/10/16 Default Algorithm Algorithm 1 Algorithm 2 Algorithm3 Algorithm 4 Algorithm 5 DFW Data Science
  • 71. • 200%+ increase in click-through rates • 91% lower TCO • Fewer support tickets • Increased customer satisfaction
  • 74. Building a Taxonomy of Entities Many ways to generate this: • Topic Modelling • Clustering of documents • Statistical Analysis of interesting phrases - Word2Vec / Glove / Dice Conceptual Search • Buy a dictionary (often doesn’t work for domain-specific search problems) • Generate a model of domain-specific phrases by mining query logs for commonly searched phrases within the domain* * K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014. DFW Data Science
  • 79. Demo: Solr Text Tagger
  • 82. Probabilistic Query Parser Goal: given a query, predict which combinations of keywords should be combined together as phrases Example: senior java developer hadoop Possible Parsings: senior, java, developer, hadoop "senior java", developer, hadoop "senior java developer", hadoop "senior java developer hadoop” "senior java", "developer hadoop” senior, "java developer", hadoop senior, java, "developer hadoop" Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disambiguation”, Bay Area Search Meetup, November 2015. DFW Data Science
  • 84. Semantic Query Parsing Identification of phrases in queries using two steps: 1) Check a dictionary of known terms that is continuously built, cleaned, and refined based upon common inputs from interactions with real users of the system. The SolrTextTagger works well for this.* 2) Also invoke a probabilistic query parser to dynamically identify unknown phrases using statistics from a corpus of data (language model) *K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014. DFW Data Science
  • 87. Knowledge Graph Semantic Data Encoded into Free Text Content e en eng engi engineer engineers engineer engineersNode Type: Term software engineer software engineers electrical engineering engineer engineering software … … … Node Type: Character Sequence Node Type: Term Sequence Node Type: Document id: 1 text: looking for a software engineerwith degree in computer science or electrical engineering id: 2 text: apply to be a software engineer and work with other great software engineers id: 3 text: start a great careerin electrical engineering … … DFW Data Science
  • 88. id: 1 job_title: Software Engineer desc: software engineer at a great company skills: .Net, C#, java id: 2 job_title: Registered Nurse desc: a registered nurse at hospital doing hard work skills: oncology, phlebotemy id: 3 job_title: Java Developer desc: a software engineer or a java engineer doing work skills: java, scala, hibernate field term postings list doc pos desc a 1 4 2 1 3 1, 5 at 1 3 2 4 company 1 6 doing 2 6 3 8 engineer 1 2 3 3, 7 great 1 5 hard 2 7 hospital 2 5 java 3 6 nurse 2 3 or 3 4 registered 2 2 software 1 1 3 2 work 2 10 3 9 job_title java developer 3 1 … … … … field doc term desc 1 a at company engineer great software 2 a at doing hard hospital nurse registered work 3 a doing engineer java or software work job_title 1 Software Engineer … … … Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph DFW Data Science
  • 89. Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Set-theory View Graph View How the Graph Traversal Works skill: Java skill: Scala skill: Hibernate skill: Oncology doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 skill: Java skill: Java skill: Scala skill: Hibernate skill: Oncology Data Structure View Java Scala Hibernate docs 1, 2, 6 docs 3, 4 Oncology doc 5 DFW Data Science
  • 90. Knowledge Graph Graph Model Structure: Single-level Traversal / Scoring: Multi-level Traversal / Scoring:
  • 91. Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Multi-level Traversal Data Structure View Graph View doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 skill: Java skill: Java skill: Scala skill: Hibernate skill: Oncology doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 job_title: Software Engineer job_title: Data Scientist job_title: Java Developer …… Inverted Index Lookup Forward Index Lookup Forward Index Lookup Inverted Index Lookup Java Java Developer Hibernate Scala Software Engineer Data Scientist has_related_job_title has_related_job_title DFW Data Science
  • 92. Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Scoring nodes in the Graph Foreground vs. Background Analysis Every term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context. countFG(x) - totalDocsFG * probBG(x) z = -------------------------------------------------------- sqrt(totalDocsFG * probBG(x) * (1 - probBG(x))) { "type":"keywords”, "values":[ { "value":"hive", "relatedness": 0.9765, "popularity":369 }, { "value":"spark", "relatedness": 0.9634, "popularity":15653 }, { "value":".net", "relatedness": 0.5417, "popularity":17683 }, { "value":"bogus_word", "relatedness": 0.0, "popularity":0 }, { "value":"teaching", "relatedness": -0.1510, "popularity":9923 }, { "value":"CPR", "relatedness": -0.4012, "popularity":27089 } ] } + - Foreground Query: "Hadoop" DFW Data Science
  • 93. Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Multi-level Graph Traversal with Scores software engineer* (materialized node) Java C# .NET .NET Developer Java Developer Hibernate ScalaVB.NET Software Engineer Data Scientist Skill Nodes has_related_skillStarting Node Skill Nodes has_related_skill Job Title Nodes has_related_job_title 0.90 0.88 0.93 0.93 0.34 0.74 0.91 0.89 0.74 0.89 0.780.72 0.48 0.93 0.76 0.83 0.80 0.64 0.61 0.780.55 DFW Data Science
  • 94. Knowledge Graph Use Case: Document Summarization Experiment: Pass in raw text (extracting phrases as needed), and rank their similarity to the documents using the SKG. Additionally, can traverse the graph to “related” entities/keyword phrases NOT found in the original document Applications: Content-based and multi-modal recommendations (no cold-start problem), data cleansing prior to clustering or other ML methods, semantic search / similarity scoring
  • 100. • Perform relational operations on streams • Stream sources: search, jdbc, facets, features, gatherNodes, shortestPath, train, features, model, random, stats, topic • Stream decorators: classify, commit, complement, daemon, executor, fetch, having, leftOuterJoin, hashJoin, innerJoin, intersect, merge, null, outerHashJoin, parallel, priority, reduce, rollup, scoreNodes, select, sort, top, unique, update Streaming Expressions Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016. DFW Data Science
  • 101. • Relies on docValues (column-oriented data structure) and /export handler • Extreme read performance (8-10x faster than queries using cursorMark) • Facet or map/reduce style aggregation modes • Tiered architecture • SQL interface tier • Worker tier (scale a pool of worker “nodes” independently of the data collection) • Data tier (Solr collection) Streaming API: Nuts and Bolts Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016. DFW Data Science
  • 102. Streaming Expressions - Examples Shortest-path Graph Traversal Parallel Batch Procesing Train a Logistic Regression Model Distributed Joins Rapid Export of all Search Results Pull Results from External Database Sources: https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html Classifying Search Results
  • 104. • SQL is ubiquitous language for analytics • People: Less training and easier to understand • Tools! Solr as JDBC data source (DbVisualizer, Apache Zeppelin, and SQuirreL SQL) • Query planning / optimization can evolve iteratively SQL is natural extension for Solr’s parallel computing engine Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016. DFW Data Science
  • 105. Give me the top 5 action movies with rating of 4 or better Mental Warm-up /select?q=*:* &fq=genre_ss:action &fq=rating_i:[4 TO *] &facet=true &facet.limit=5 &facet.mincount=1 &facet.field=title_s SELECT title_s, COUNT(*) as cnt FROM movielens WHERE genre_ss='action' AND rating_i='[4 TO *]’ GROUP BY title_s ORDER BY cnt desc LIMIT 5 { ... "facet_counts":{ "facet_fields":{ "title_s":[ "Star Wars (1977)",501, "Return of the Jedi (1983)",379, "Godfather, The (1972)",351, "Raiders of the Lost Ark (1981)",348, "Empire Strikes Back, The (1980)",293]}, ...}} {"result-set":{"docs":[ {"title_s":"Star Wars (1977)”,"cnt":501}, {"title_s":"Return of the Jedi (1983)","cnt":379}, {"title_s":"Godfather, The (1972)","cnt":351}, {"title_s":"Raiders of the Lost Ark (1981)","cnt":348}, {"title_s":"Empire Strikes Back, The (1980)","cnt":293}, {"EOF":true,"RESPONSE_TIME":42}]}} Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016. DFW Data Science
  • 106. SELECT gender_s, COUNT(*) as num_ratings, avg(rating_i) as avg_rating FROM movielens WHERE genre_ss='romance' AND age_i='[30 TO *]' GROUP BY gender_s ORDER BY num_ratings desc SQL Examples SELECT title_s, genre_s, COUNT(*) as num_ratings, avg(rating_i) as avg_rating FROM movielens GROUP BY title_s, genre_s HAVING num_ratings >= 100 ORDER BY avg_rating desc LIMIT 5 SELECT DISTINCT(user_id_i) as user_id FROM movielens WHERE genre_ss='documentary' ORDER BY user_id desc Give me the avg rating for men and women over 30 for romance movies Give me the top 5 rated movies with at least 100 ratings Give me the set of unique users that have rated documentaries DFW Data Science Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
  • 107. parallel(workers, hashJoin( search(movielens, q=*:*, fl="user_id_i,movie_id_i,rating_i", sort="movie_id_i asc", partitionKeys="movie_id_i"), hashed=search(movielens_movies, q=*:*, fl="movie_id_i,title_s,genre_s", sort="movie_id_i asc", partitionKeys="movie_id_i"), on="movie_id_i" ), workers="4", sort="movie_id_i asc" ) Streaming Expression Example: hashJoin The small “right” side of the join gets loaded into memory on each worker node Each shard queried by N workers, so 4 workers x 4 shards means 16 queries (usually all replicas per shard are hit) Workers collection isolates parallel computation nodes from data nodes DFW Data Science Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
  • 108. • spark-solr project uses streaming API to pull data from Solr into Spark jobs if docValues enabled, see: https://github.com/lucidworks/spark-solr • Perform aggregations of “signals”, e.g clicks, to compute boosts and recommendations using Spark • Custom Scala script jobs to perform complex analysis on data in Solr, e.g. sessionize request logs • Power rich data visualizations using Fusion’s SQL Engine powered by SparkSQL + Solr streaming aggregations How we use Solr streaming API in Fusion Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016. DFW Data Science
  • 110. Comparing SQL Capabilities Fusion Solr Hive Drill SparkSQL Secret Sauce SparkSQL Benefits + Solr Benefits + Enterprise Security Push complex query constructs into engine (full text, spatial, relevancy, graph, functions, etc) Mature SQL solution for Hadoop stack Execute SQL over NoSQL data sources Spark core (optimized shuffle, in-memory, etc), integration of other APIs: ML, Streaming, GraphX SQL Features Maturing Evolving Mature Maturing Maturing Scaling Linear (shards and replicas) backed by inverted index; Linear (shards and replicas) backed by inverted index Limited by Hadoop infrastructure (table scans) Good, but need to benchmark Memory intensive; Scale out using Spark cluster, backed by RDDs Integration w/ external systems Analytics Catalog API, JDCB Driver, ODBC Bridge JDBC stream source external tables / plugin API many drivers available DataSource API, many systems supported DFW Data Science Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
  • 111. Demo: Fusion SQL Engine
  • 112. Graph
  • 113. Graph Use Cases • Anomaly detection / fraud detection • Recommenders • Social network analysis • Graph Search • Access Control • Relationship discovery / scoring Examples o Find all draft blog posts about “Parallel SQL” written by a developer o Find all tweets mentioning “Solr” by me or people I follow o Find all draft blog posts about “Parallel SQL” written by a developer o Find 3-star hotels in NYC my friends stayed in last year DFW Data Science Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
  • 114. Solr Graph Timeline • Some data is much more naturally represented as a graph structure • Solr 6.0: Introduced the Graph Query Parser • Solr 6.1: Introduced Graph Streaming expressions … • Solr 6.3: Current Version • TBD: Semantic Knowledge Graph (patch available) DFW Data Science
  • 115. Graph Query Parser • Query-time, cyclic aware graph traversal is able to rank documents based on relationships • Provides controls for depth, filtering of results and inclusion of root and/or leaves • Limitations: single node/shard only Examples: • http://localhost:8983/solr/graph/query?fl=id,score& q={!graph from=in_edge to=out_edge}id:A • http://localhost:8983/solr/my_graph/query?fl=id& q={!graph from=in_edge to=out_edge traversalFilter='foo:[* TO 15]'}id:A • http://localhost:8983/solr/my_graph/query?fl=id& q={!graph from=in_edge to=out_edge maxDepth=1}foo:[* TO 10] DFW Data Science Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
  • 116. Graph Streaming Expressions • Part of Solr’s broader Streaming Expressions capability • Implements a powerful, breadth-first traversal • Works across shards AND collections • Supports aggregations • Cycle aware curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d ‘expr=…’"http://localhost:18984/solr/movielens/stream" DFW Data Science Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
  • 117. All movies that user 389 watched expr:gatherNodes(movielens,walk="389->user_id_i",gather="movie_id_i") DFW Data Science Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
  • 118. All movies that viewers of a specific movie watched expr:gatherNodes(movielens, gatherNodes(movielens,walk="161->movie_id_i",gather="user_id_i"), walk="node->user_id_i",gather="movie_id_i", trackTraversal="true" ) Movie 161: “The Air Up There” DFW Data Science Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
  • 119. Collaborative Filtering expr=top(n="5", sort="count(*) desc", gatherNodes(movielens, top(n="30", sort="count(*) desc", gatherNodes(movielens, search(movielens, q="user_id_i:305", fl="movie_id_i", sort="movie_id_i asc", qt=“/export"), walk="movie_id_i->movie_id_i", gather="user_id_i", maxDocFreq="10000", count(*) ) ), walk="node->user_id_i", gather="movie_id_i", count(*) ) ) DFW Data Science Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
  • 120. Comparing Graph Choices Solr Elastic Graph Neo4J Spark GraphX Best Use Case QParser: predef. relationships as filters Expressions: fast, query-based, dist. graph ops Limited to sequential, term relatedness exploration only Graph ops and querying that fit on a single node Large-scale, iterative graph ops Common Graph Algorithms (e.g. Pregel, Traversal) Partial No Yes Yes Scaling QParser: Co-located Shards only Expressions: Yes Yes Master/Replica Yes Commercial License Required No Yes GPLv3 No Visualizations GraphML support (e.g. Gephi) Kibana Neo4j browser 3rd party DFW Data Science Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
  • 122. Contact Info Trey Grainger [email protected] @treygrainger http://solrinaction.com Meetup discount (39% off): 39grainger Other presentations: http://www.treygrainger.com DFW Data Science