Webinar: Solr 6 Deep Dive - SQL and Graph

2016
OCTOBER 11-14 
BOSTON, MA
http://lucenerevolution.com

Solr 6 Deep Dive: SQL
and Graph
Grant Ingersoll
@gsingers
CTO, Lucidworks
Tim Potter
@thelabdude
Sr. Software Engineer, Lucidworks

• Motivations
• Streaming Expressions and Parallel SQL
• Graph Capabilities
• How does this compare to…?
• Future Directions
Agenda

Search-Driven
Everything
Customer
Service
Customer
Insights
Fraud Surveillance
Research
Portal
Online Retail
Digital
Content

• Big data systems have grown too complex trying to satisfy a variety of access patterns
• Fast primary key lookups / atomic updates (Solr, HBase, Cassandra, …)
• Low-latency ranked retrieval (Solr, Elastic, DataStax, …)
• Large, distributed table scans (Spark, M/R, Pig, Cassandra, Hive, Impala, …)
• Graph traversal (Graphx, Giraph, Neo4j, …)
• De-normalization can be inconvenient as related data sets can change at different velocities
(movies vs. movie ratings)
• Leverage progress made by the Solr community to support big data in Solr using horizontal
scalability (shards & replicas)
• Don’t forget about speed ~ Search engines in general and Solr in particular are extremely fast!
Why Solr needs Parallel Computation

Lucidworks Fusion Is Search-Driven Everything
•Drive next generation relevance
via Content, Collaboration and
Context
•Harness best in class Open
Source: Apache Solr + Spark
•Simplify application
development and reduce
ongoing maintenance
CATALOG
DYNAMIC NAVIGATION
AND LANDING PAGES
INSTANT INSIGHTS AND
ANALYTICS
PERSONALIZED
SHOPPING EXPERIENCE
PROMOTIONS USER HISTORY
Data Acquisition
Indexing & Streaming
Smart Access API
Recommendations & 
Alerts
Analytics & InsightsExtreme Relevancy
Access data from
anywhere to build
intelligent, data-
driven applications.

Fusion Architecture
RESTAPI
Worker Worker Cluster Mgr.
Apache Spark
Shards Shards
Apache Solr
HDFS(Optional)
Shared Conﬁg
Mgmt
Leader
Election
Load
Balancing
ZK 1
Apache Zookeeper
ZK N
DATABASEWEBFILELOGSHADOOP CLOUD
Connectors
Alerting/Messaging
NLP
Pipelines
Blob Storage
Scheduling
Recommenders/Signals
…
Core Services
Admin UI
SECURITY BUILT-IN
Lucidworks View

• SQL is ubiquitous language for analytics
• People: Less training and easier to understand
• Tools! Solr as JDBC data source (DbVisualizer,
Apache Zeppelin, and SQuirreL SQL)
• Query planning / optimization can evolve
iteratively
SQL is natural extension for Solr’s parallel computing engine

Give me the top 5 action movies with rating of 4 or better
Mental Warm-up
/select?q=*:*
&fq=genre_ss:action
&fq=rating_i:[4 TO *]
&facet=true
&facet.limit=5
&facet.mincount=1
&facet.field=title_s
SELECT title_s, COUNT(*) as cnt
FROM movielens
WHERE genre_ss='action'
AND rating_i='[4 TO *]’
GROUP BY title_s
ORDER BY cnt desc
LIMIT 5
{
...

"facet_counts":{

"facet_fields":{

"title_s":[

"Star
Wars
(1977)",501,

"Return
of
the
Jedi
(1983)",379,

"Godfather,
The
(1972)",351,

"Raiders
of
the
Lost
Ark
(1981)",348,

"Empire
Strikes
Back,
The
(1980)",293]},

...}}
{"result-‐set":{"docs":[

{"title_s":"Star
Wars
(1977)”,"cnt":501},

{"title_s":"Return
of
the
Jedi
(1983)","cnt":379},

{"title_s":"Godfather,
The
(1972)","cnt":351},

{"title_s":"Raiders
of
the
Lost
Ark
(1981)","cnt":348},

{"title_s":"Empire
Strikes
Back,
The
(1980)","cnt":293},

{"EOF":true,"RESPONSE_TIME":42}]}}

SELECT
gender_s,
COUNT(*)
as
num_ratings,
avg(rating_i)
as
avg_rating

FROM
movielens

WHERE
genre_ss='romance'
AND
age_i='[30
TO
*]'

GROUP
BY
gender_s

ORDER
BY
num_ratings
desc
SQL Examples

SELECT
title_s,
genre_s,
COUNT(*)
as
num_ratings,
avg(rating_i)
as
avg_rating

FROM
movielens

GROUP
BY
title_s,
genre_s

HAVING
num_ratings
>=
100

ORDER
BY
avg_rating
desc

LIMIT
5

SELECT
DISTINCT(user_id_i)
as
user_id

FROM
movielens

WHERE
genre_ss='documentary'

ORDER
BY
user_id
desc
Give me the avg rating for men
and women over 30 for
romance movies
Give me the top 5 rated movies
with at least 100 ratings
Give me the set of unique users
that have rated documentaries

• Perform relational operations on
streams
• Stream sources: search, jdbc, facets,
stats, topic, gatherNodes
• Stream decorators: complement,
daemon, leftOuterJoin, hashJoin,
innerJoin, intersect, merge,
outerHashJoin, parallel, reduce,
random, rollup, select, shortestPath,
sort, top, unique, update
Streaming Expressions

• Relies on docValues (column-oriented data
structure) and /export handler
• Extreme read performance (8-10x faster than
queries using cursorMark)
• Facet or map/reduce style aggregation modes
• Tiered architecture
• SQL interface tier
• Worker tier (scale a pool of worker “nodes”
independently of the data collection)
• Data tier (Solr collection)
Streaming API: Nuts and Bolts

parallel(workers,

hashJoin(

search(movielens,
q=*:*,

fl="user_id_i,movie_id_i,rating_i",

sort="movie_id_i
asc",

partitionKeys="movie_id_i"),

hashed=search(movielens_movies,
q=*:*,

fl="movie_id_i,title_s,genre_s",

sort="movie_id_i
asc",

partitionKeys="movie_id_i"),

on="movie_id_i"

),

workers="4",

sort="movie_id_i
asc"

)

Streaming Expression Example: hashJoin
The small “right” side of the join
gets loaded into memory on
each worker node
Each shard queried by N
workers, so 4 workers x 4 shards
means 16 queries (usually all
replicas per shard are hit)
Workers collection isolates parallel
computation nodes from data nodes

Aggregation Modes
• Map/Reduce aggregationMode — for high cardinality aggregations and distributed joins
(requires a shuffle phase to move keys to correct worker)
curl
-‐-‐data-‐urlencode
"stmt=SELECT
user_id_i,
avg(rating_i)
as
avg_rating
FROM
movielens
GROUP
BY
user_id_i"

“http://host:port/solr/movielens/sql?aggregationMode=map_reduce”

• Facet aggregationMode — Uses JSON facet engine for high performance on low-to-moderate
cardinality fields (e.g. movies)
curl
-‐-‐data-‐urlencode
"stmt=SELECT
movie_id_i,
avg(rating_i)
as
avg_rating
FROM
movielens
GROUP
BY
movie_id_i"

“http://host:port/solr/movielens/sql?aggregationMode=facet”

• spark-solr project uses streaming API to pull data
from Solr into Spark jobs if docValues enabled,
see: https://github.com/lucidworks/spark-solr
• Perform aggregations of “signals”, e.g clicks, to
compute boosts and recommendations using
Spark
• Custom Scala script jobs to perform complex
analysis on data in Solr, e.g. sessionize request
logs
• Power rich data visualizations using Spark SQL
over Solr streaming aggregations
How we use Solr streaming API in Fusion

• Anomaly detection and fraud detection
• Recommenders
• Social network analysis
• Graph Search
• Access Control
• Examples:
• Find all tweets mentioning “Solr” by me or people I follow
• Find all draft blog posts about “Parallel SQL” written by a developer
• Find 3-star hotels in NYC my friends stayed in last year
Graph Use Cases

• Some data is much more naturally represented as a graph structure
• Traditionally hard to deal with in search’s inverted index
• Solr 6.0 introduces the Graph Query Parser
• Solr 6.1 brings Graph Streaming expressions
Graph Basics

• Query-time, cyclic aware graph traversal is able to rank
documents based on relationships
• Provides controls for depth, filtering of results and
inclusion of root and/or leaves
• Limitations: single node/shard only
• Examples:
• http://localhost:8983/solr/graph/query?
fl=id,score&q={!graph+from=in_edge+to=out_edge}
id:A
• http://localhost:8983/solr/my_graph/query?fl=id&q={!
graph+from=in_edge+to=out_edge
+traversalFilter='foo:[*+TO+15]'}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&q={!
graph+from=in_edge+to=out_edge+maxDepth=1}foo:
[*+TO+10]
Graph Query Parser

•Part of Solr’s broader Streaming Expressions capability
•Implements a powerful, breadth-ﬁrst traversal
•Works across shards AND collections
•Supports aggregations
•Cycle aware
Graph Streaming Expressions (Solr 6.1)
curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d ‘expr=…’ "http://localhost:
18984/solr/movielens/stream"

All movies that user 389 watched
expr:gatherNodes(movielens,walk="389->user_id_i",gather="movie_id_i")

All the Movies that viewers of Movie 161 watched
expr:gatherNodes(movielens,
gatherNodes(movielens,walk="161->movie_id_i",gather="user_id_i"),
walk="node->user_id_i",gather="movie_id_i", trackTraversal="true")
Movie 161: “The Air Up There”

Collaborative Filtering Example
expr=top(n="5", sort="count(*) desc",
gatherNodes(movielens, top(n="30", sort="count(*) desc",
gatherNodes(movielens,
search(movielens, q="user_id_i:305", fl="movie_id_i", sort="movie_id_i asc", qt=“/export"),
walk="movie_id_i->movie_id_i", gather="user_id_i",
maxDocFreq="10000", count(*))),
walk="node->user_id_i", gather="movie_id_i", count(*)))'

Comparing Graph Choices
Solr Elastic Graph Neo4J
Spark
GraphX
Best Use Case
QParser: predef.
relationships as ﬁlters
Expressions: fast,
query-based, dist.
graph ops
Term relationship
exploration
Graph ops and
querying that ﬁt on a
single node
Large-scale, iterative
graph ops
Common Graph
Algorithms (e.g.
Pregel, Traversal)
Partial No Yes Yes
Scaling
QParser: no
Expressions: yes
Yes Master/Replica Yes
Commercial
License Required
No Yes GPLv3 No
Visualizations
GraphML support
(Gephi)
Kibana Neo4j browser 3rd party

Comparing Big Data SQL Choices
Solr Hive Drill SparkSQL
Secret Sauce
Push complex query
constructs into engine
(full text, spatial,
functions, etc)
Mature SQL solution
for Hadoop stack
Execute SQL over
NoSQL data sources
Spark core (optimized
shufﬂe, in-memory,
etc), integration of
other APIs: ML,
Streaming, GraphX
SQL Features Evolving Mature Maturing Maturing
Scaling
Linear (shards and
replicas) backed by
inverted index
Limited by Hadoop
infrastructure (table
scans)
Good, but need to
benchmark
Memory intensive;
Scale out using Spark
cluster, backed by
RDDs
Integration w/ external
systems
JDBC stream source
external tables /
plugin API
many drivers
available
DataSource API,
many systems
supported

• Alternate graph traversal approaches, e.g. depth-ﬁrst
• Possible support for Gremlin (Graph Traversal Language from Tinker Pop)
• Additional graph algorithms (e.g. strongly conn. components, page rank)
Future Work

• No support for pushing >, >=, <, <= operators in
WHERE clause down into Solr as range queries;
use range syntax [4 TO *] for now
• Using Solr function queries in WHERE clause, e.g.
WHERE
location_p='{!geofilt
d=90

pt=37.773972,-‐122.431297
sfield=location_p}’

• SQL Joins (SOLR-8593)
• Port SQL layer to use Apache Calcite vs. Presto
SQL: Current Limitations and Future Plans

Webinar: Solr 6 Deep Dive - SQL and Graph

More Related Content

What's hot (19)

Similar to Webinar: Solr 6 Deep Dive - SQL and Graph (20)

More from Lucidworks (20)

Recently uploaded (20)

Webinar: Solr 6 Deep Dive - SQL and Graph