SlideShare a Scribd company logo
Webinar: Solr 6 Deep Dive - SQL and Graph
2016
OCTOBER 11-14

BOSTON, MA
http://lucenerevolution.com
Solr 6 Deep Dive: SQL
and Graph
Grant Ingersoll
@gsingers
CTO, Lucidworks
Tim Potter
@thelabdude
Sr. Software Engineer, Lucidworks
• Motivations
• Streaming Expressions and Parallel SQL
• Graph Capabilities
• How does this compare to…?
• Future Directions
Agenda
Search-Driven
Everything
Customer
Service
Customer
Insights
Fraud Surveillance
Research
Portal
Online Retail
Digital
Content
• Big data systems have grown too complex trying to satisfy a variety of access patterns
• Fast primary key lookups / atomic updates (Solr, HBase, Cassandra, …)
• Low-latency ranked retrieval (Solr, Elastic, DataStax, …)
• Large, distributed table scans (Spark, M/R, Pig, Cassandra, Hive, Impala, …)
• Graph traversal (Graphx, Giraph, Neo4j, …)
• De-normalization can be inconvenient as related data sets can change at different velocities
(movies vs. movie ratings)
• Leverage progress made by the Solr community to support big data in Solr using horizontal
scalability (shards & replicas)
• Don’t forget about speed ~ Search engines in general and Solr in particular are extremely fast!
Why Solr needs Parallel Computation
Lucidworks Fusion Is Search-Driven Everything
•Drive next generation relevance
via Content, Collaboration and
Context
•Harness best in class Open
Source: Apache Solr + Spark
•Simplify application
development and reduce
ongoing maintenance
CATALOG
DYNAMIC NAVIGATION
AND LANDING PAGES
INSTANT INSIGHTS AND
ANALYTICS
PERSONALIZED
SHOPPING EXPERIENCE
PROMOTIONS USER HISTORY
Data Acquisition
Indexing & Streaming
Smart Access API
Recommendations &

Alerts
Analytics & InsightsExtreme Relevancy
Access data from
anywhere to build
intelligent, data-
driven applications.
Fusion Architecture
RESTAPI
Worker Worker Cluster Mgr.
Apache Spark
Shards Shards
Apache Solr
HDFS(Optional)
Shared Config
Mgmt
Leader
Election
Load
Balancing
ZK 1
Apache Zookeeper
ZK N
DATABASEWEBFILELOGSHADOOP CLOUD
Connectors
Alerting/Messaging
NLP
Pipelines
Blob Storage
Scheduling
Recommenders/Signals
…
Core Services
Admin UI
SECURITY BUILT-IN
Lucidworks View
Streaming Expressions
and SQL
• SQL is ubiquitous language for analytics
• People: Less training and easier to understand
• Tools! Solr as JDBC data source (DbVisualizer,
Apache Zeppelin, and SQuirreL SQL)
• Query planning / optimization can evolve
iteratively
SQL is natural extension for Solr’s parallel computing engine
Give me the top 5 action movies with rating of 4 or better
Mental Warm-up
/select?q=*:*
&fq=genre_ss:action
&fq=rating_i:[4 TO *]
&facet=true
&facet.limit=5
&facet.mincount=1
&facet.field=title_s
SELECT title_s, COUNT(*) as cnt
FROM movielens
WHERE genre_ss='action'
AND rating_i='[4 TO *]’
GROUP BY title_s
ORDER BY cnt desc
LIMIT 5
{	
  ...	
  
	
  	
  	
  "facet_counts":{	
  
	
  	
  	
  	
  "facet_fields":{	
  
	
  	
  	
  	
  	
  	
  "title_s":[	
  
	
  	
  	
  	
  	
  	
  	
  	
  "Star	
  Wars	
  (1977)",501,	
  
	
  	
  	
  	
  	
  	
  	
  	
  "Return	
  of	
  the	
  Jedi	
  (1983)",379,	
  
	
  	
  	
  	
  	
  	
  	
  	
  "Godfather,	
  The	
  (1972)",351,	
  
	
  	
  	
  	
  	
  	
  	
  	
  "Raiders	
  of	
  the	
  Lost	
  Ark	
  (1981)",348,	
  
	
  	
  	
  	
  	
  	
  	
  	
  "Empire	
  Strikes	
  Back,	
  The	
  (1980)",293]},	
  
	
  	
  	
  	
  ...}}
{"result-­‐set":{"docs":[	
  
{"title_s":"Star	
  Wars	
  (1977)”,"cnt":501},	
  
{"title_s":"Return	
  of	
  the	
  Jedi	
  (1983)","cnt":379},	
  
{"title_s":"Godfather,	
  The	
  (1972)","cnt":351},	
  
{"title_s":"Raiders	
  of	
  the	
  Lost	
  Ark	
  (1981)","cnt":348},	
  
{"title_s":"Empire	
  Strikes	
  Back,	
  The	
  (1980)","cnt":293},	
  
{"EOF":true,"RESPONSE_TIME":42}]}}
 	
  SELECT	
  gender_s,	
  COUNT(*)	
  as	
  num_ratings,	
  avg(rating_i)	
  as	
  avg_rating	
  	
  
	
  	
  	
  	
  FROM	
  movielens	
  	
  
	
  	
  	
  WHERE	
  genre_ss='romance'	
  AND	
  age_i='[30	
  TO	
  *]'	
  
GROUP	
  BY	
  gender_s	
  	
  
ORDER	
  BY	
  num_ratings	
  desc
SQL Examples
	
  	
  SELECT	
  title_s,	
  genre_s,	
  COUNT(*)	
  as	
  num_ratings,	
  avg(rating_i)	
  as	
  avg_rating	
  	
  
	
  	
  	
  	
  FROM	
  movielens	
  	
  
GROUP	
  BY	
  title_s,	
  genre_s	
  	
  
	
  	
  HAVING	
  num_ratings	
  >=	
  100	
  	
  
ORDER	
  BY	
  avg_rating	
  desc	
  	
  
	
  	
  	
  LIMIT	
  5
	
  	
  SELECT	
  DISTINCT(user_id_i)	
  as	
  user_id	
  	
  
	
  	
  	
  	
  FROM	
  movielens	
  	
  
	
  	
  	
  WHERE	
  genre_ss='documentary'	
  	
  
ORDER	
  BY	
  user_id	
  desc
Give me the avg rating for men
and women over 30 for
romance movies
Give me the top 5 rated movies
with at least 100 ratings
Give me the set of unique users
that have rated documentaries
• Perform relational operations on
streams
• Stream sources: search, jdbc, facets,
stats, topic, gatherNodes
• Stream decorators: complement,
daemon, leftOuterJoin, hashJoin,
innerJoin, intersect, merge,
outerHashJoin, parallel, reduce,
random, rollup, select, shortestPath,
sort, top, unique, update
Streaming Expressions
• Relies on docValues (column-oriented data
structure) and /export handler
• Extreme read performance (8-10x faster than
queries using cursorMark)
• Facet or map/reduce style aggregation modes
• Tiered architecture
• SQL interface tier
• Worker tier (scale a pool of worker “nodes”
independently of the data collection)
• Data tier (Solr collection)
Streaming API: Nuts and Bolts
parallel(workers,	
  
	
  	
  hashJoin(	
  
	
  	
  	
  	
  search(movielens,	
  q=*:*,	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  fl="user_id_i,movie_id_i,rating_i",	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sort="movie_id_i	
  asc",	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  partitionKeys="movie_id_i"),	
  
	
  	
  	
  	
  hashed=search(movielens_movies,	
  q=*:*,	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  fl="movie_id_i,title_s,genre_s",	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sort="movie_id_i	
  asc",	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  partitionKeys="movie_id_i"),	
  
	
  	
  	
  	
  on="movie_id_i"	
  
	
  	
  ),	
  
	
  	
  workers="4",	
  
	
  	
  sort="movie_id_i	
  asc"	
  
)	
  
Streaming Expression Example: hashJoin
The small “right” side of the join
gets loaded into memory on
each worker node
Each shard queried by N
workers, so 4 workers x 4 shards
means 16 queries (usually all
replicas per shard are hit)
Workers collection isolates parallel
computation nodes from data nodes
Aggregation Modes
• Map/Reduce aggregationMode — for high cardinality aggregations and distributed joins
(requires a shuffle phase to move keys to correct worker)
curl	
  -­‐-­‐data-­‐urlencode	
  "stmt=SELECT	
  user_id_i,	
  avg(rating_i)	
  as	
  avg_rating	
  FROM	
  movielens	
  GROUP	
  BY	
  user_id_i"	
  	
  
“http://host:port/solr/movielens/sql?aggregationMode=map_reduce”	
  
• Facet aggregationMode — Uses JSON facet engine for high performance on low-to-moderate
cardinality fields (e.g. movies)
curl	
  -­‐-­‐data-­‐urlencode	
  "stmt=SELECT	
  movie_id_i,	
  avg(rating_i)	
  as	
  avg_rating	
  FROM	
  movielens	
  GROUP	
  BY	
  movie_id_i"	
  	
  
	
  	
  “http://host:port/solr/movielens/sql?aggregationMode=facet”
• spark-solr project uses streaming API to pull data
from Solr into Spark jobs if docValues enabled,
see: https://github.com/lucidworks/spark-solr
• Perform aggregations of “signals”, e.g clicks, to
compute boosts and recommendations using
Spark
• Custom Scala script jobs to perform complex
analysis on data in Solr, e.g. sessionize request
logs
• Power rich data visualizations using Spark SQL
over Solr streaming aggregations
How we use Solr streaming API in Fusion
Graph
• Anomaly detection and fraud detection
• Recommenders
• Social network analysis
• Graph Search
• Access Control
• Examples:
• Find all tweets mentioning “Solr” by me or people I follow
• Find all draft blog posts about “Parallel SQL” written by a developer
• Find 3-star hotels in NYC my friends stayed in last year
Graph Use Cases
• Some data is much more naturally represented as a graph structure
• Traditionally hard to deal with in search’s inverted index
• Solr 6.0 introduces the Graph Query Parser
• Solr 6.1 brings Graph Streaming expressions
Graph Basics
• Query-time, cyclic aware graph traversal is able to rank
documents based on relationships
• Provides controls for depth, filtering of results and
inclusion of root and/or leaves
• Limitations: single node/shard only
• Examples:
• http://localhost:8983/solr/graph/query?
fl=id,score&q={!graph+from=in_edge+to=out_edge}
id:A
• http://localhost:8983/solr/my_graph/query?fl=id&q={!
graph+from=in_edge+to=out_edge
+traversalFilter='foo:[*+TO+15]'}id:A
• http://localhost:8983/solr/my_graph/query?fl=id&q={!
graph+from=in_edge+to=out_edge+maxDepth=1}foo:
[*+TO+10]
Graph Query Parser
•Part of Solr’s broader Streaming Expressions capability
•Implements a powerful, breadth-first traversal
•Works across shards AND collections
•Supports aggregations
•Cycle aware
Graph Streaming Expressions (Solr 6.1)
curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d ‘expr=…’ "http://localhost:
18984/solr/movielens/stream"
All movies that user 389 watched
expr:gatherNodes(movielens,walk="389->user_id_i",gather="movie_id_i")
All the Movies that viewers of Movie 161 watched
expr:gatherNodes(movielens,
gatherNodes(movielens,walk="161->movie_id_i",gather="user_id_i"),
walk="node->user_id_i",gather="movie_id_i", trackTraversal="true")
Movie 161: “The Air Up There”
Collaborative Filtering Example
expr=top(n="5", sort="count(*) desc",
gatherNodes(movielens, top(n="30", sort="count(*) desc",
gatherNodes(movielens,
search(movielens, q="user_id_i:305", fl="movie_id_i", sort="movie_id_i asc", qt=“/export"),
walk="movie_id_i->movie_id_i", gather="user_id_i",
maxDocFreq="10000", count(*))),
walk="node->user_id_i", gather="movie_id_i", count(*)))'
Comparisons
Comparing Graph Choices
Solr Elastic Graph Neo4J
Spark
GraphX
Best Use Case
QParser: predef.
relationships as filters
Expressions: fast,
query-based, dist.
graph ops
Term relationship
exploration
Graph ops and
querying that fit on a
single node
Large-scale, iterative
graph ops
Common Graph
Algorithms (e.g.
Pregel, Traversal)
Partial No Yes Yes
Scaling
QParser: no
Expressions: yes
Yes Master/Replica Yes
Commercial
License Required
No Yes GPLv3 No
Visualizations
GraphML support
(Gephi)
Kibana Neo4j browser 3rd party
Comparing Big Data SQL Choices
Solr Hive Drill SparkSQL
Secret Sauce
Push complex query
constructs into engine
(full text, spatial,
functions, etc)
Mature SQL solution
for Hadoop stack
Execute SQL over
NoSQL data sources
Spark core (optimized
shuffle, in-memory,
etc), integration of
other APIs: ML,
Streaming, GraphX
SQL Features Evolving Mature Maturing Maturing
Scaling
Linear (shards and
replicas) backed by
inverted index
Limited by Hadoop
infrastructure (table
scans)
Good, but need to
benchmark
Memory intensive;
Scale out using Spark
cluster, backed by
RDDs
Integration w/ external
systems
JDBC stream source
external tables /
plugin API
many drivers
available
DataSource API,
many systems
supported
Future Work
• Alternate graph traversal approaches, e.g. depth-first
• Possible support for Gremlin (Graph Traversal Language from Tinker Pop)
• Additional graph algorithms (e.g. strongly conn. components, page rank)
Future Work
• No support for pushing >, >=, <, <= operators in
WHERE clause down into Solr as range queries;
use range syntax [4 TO *] for now
• Using Solr function queries in WHERE clause, e.g.
WHERE	
  location_p='{!geofilt	
  d=90	
  
pt=37.773972,-­‐122.431297	
  sfield=location_p}’	
  
• SQL Joins (SOLR-8593)
• Port SQL layer to use Apache Calcite vs. Presto
SQL: Current Limitations and Future Plans

More Related Content

What's hot (19)

PPTX
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
PDF
Webinar: Event Processing & Data Analytics with Lucidworks Fusion
Lucidworks
 
PPTX
Building a real time, solr-powered recommendation engine
Trey Grainger
 
PDF
Deduplication Using Solr: Presented by Neeraj Jain, Stubhub
Lucidworks
 
PDF
How Solr Search Works
Atlogys Technical Consulting
 
PDF
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Lucidworks
 
PDF
Solr: 4 big features
David Smiley
 
PDF
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Lucidworks
 
PDF
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Lucidworks
 
PDF
Elasticsearch: You know, for search! and more!
Philips Kokoh Prasetyo
 
PDF
Building a Real-time Solr-powered Recommendation Engine
lucenerevolution
 
PDF
Intro to Elasticsearch
Clifford James
 
PDF
elasticsearch
Satish Mohan
 
PPTX
Elasticsearch
Ricardo Peres
 
PDF
Search is the UI
danielbeach
 
PPTX
Practical Machine Learning for Smarter Search with Solr and Spark
Jake Mannix
 
KEY
State-of-the-Art Drupal Search with Apache Solr
guest432cd6
 
PPTX
An Introduction to Elastic Search.
Jurriaan Persyn
 
PPTX
ElasticSearch - DevNexus Atlanta - 2014
Roy Russo
 
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Webinar: Event Processing & Data Analytics with Lucidworks Fusion
Lucidworks
 
Building a real time, solr-powered recommendation engine
Trey Grainger
 
Deduplication Using Solr: Presented by Neeraj Jain, Stubhub
Lucidworks
 
How Solr Search Works
Atlogys Technical Consulting
 
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Lucidworks
 
Solr: 4 big features
David Smiley
 
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Lucidworks
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Lucidworks
 
Elasticsearch: You know, for search! and more!
Philips Kokoh Prasetyo
 
Building a Real-time Solr-powered Recommendation Engine
lucenerevolution
 
Intro to Elasticsearch
Clifford James
 
elasticsearch
Satish Mohan
 
Elasticsearch
Ricardo Peres
 
Search is the UI
danielbeach
 
Practical Machine Learning for Smarter Search with Solr and Spark
Jake Mannix
 
State-of-the-Art Drupal Search with Apache Solr
guest432cd6
 
An Introduction to Elastic Search.
Jurriaan Persyn
 
ElasticSearch - DevNexus Atlanta - 2014
Roy Russo
 

Similar to Webinar: Solr 6 Deep Dive - SQL and Graph (20)

PDF
Parallel SQL and Streaming Expressions in Apache Solr 6
Shalin Shekhar Mangar
 
PPTX
AI from your data lake: Using Solr for analytics
DataWorks Summit
 
PDF
Webinar: What's New in Solr 7
Lucidworks
 
PPTX
Parallel SQL for SolrCloud
Joel Bernstein
 
PDF
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Lucidworks
 
PDF
Webinar: What's New in Solr 6
Lucidworks
 
PPTX
Solr 6 Feature Preview
Yonik Seeley
 
PDF
Solr5
Leonardo Souza
 
PDF
Introduction to solr
Sematext Group, Inc.
 
PDF
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
lucenerevolution
 
PDF
KEYNOTE: Lucene / Solr road map
lucenerevolution
 
PDF
Data Engineering with Solr and Spark
Lucidworks
 
PDF
Searching Billions of Product Logs in Real Time (Use Case)
Ryan Tabora
 
PPTX
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Kai Chan
 
PDF
Solr at zvents 6 years later & still going strong
lucenerevolution
 
PDF
Solr as a Spark SQL Datasource
Chitturi Kiran
 
PDF
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Lucidworks
 
PDF
Streaming Solr - Activate 2018 talk
Amrit Sarkar
 
PDF
Webinar: Solr & Spark for Real Time Big Data Analytics
Lucidworks
 
PPTX
Apache solr
Péter Király
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Shalin Shekhar Mangar
 
AI from your data lake: Using Solr for analytics
DataWorks Summit
 
Webinar: What's New in Solr 7
Lucidworks
 
Parallel SQL for SolrCloud
Joel Bernstein
 
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Lucidworks
 
Webinar: What's New in Solr 6
Lucidworks
 
Solr 6 Feature Preview
Yonik Seeley
 
Introduction to solr
Sematext Group, Inc.
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
lucenerevolution
 
KEYNOTE: Lucene / Solr road map
lucenerevolution
 
Data Engineering with Solr and Spark
Lucidworks
 
Searching Billions of Product Logs in Real Time (Use Case)
Ryan Tabora
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Kai Chan
 
Solr at zvents 6 years later & still going strong
lucenerevolution
 
Solr as a Spark SQL Datasource
Chitturi Kiran
 
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Lucidworks
 
Streaming Solr - Activate 2018 talk
Amrit Sarkar
 
Webinar: Solr & Spark for Real Time Big Data Analytics
Lucidworks
 
Apache solr
Péter Király
 
Ad

More from Lucidworks (20)

PDF
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Lucidworks
 
PDF
Drive Agent Effectiveness in Salesforce
Lucidworks
 
PPTX
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks
 
PPTX
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks
 
PPTX
Connected Experiences Are Personalized Experiences
Lucidworks
 
PDF
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 
PPTX
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Lucidworks
 
PPTX
Preparing for Peak in Ecommerce | eTail Asia 2020
Lucidworks
 
PPTX
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Lucidworks
 
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
Lucidworks
 
PDF
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Lucidworks
 
PPTX
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Lucidworks
 
PDF
Smart Answers for Employee and Customer Support After COVID-19
Lucidworks
 
PPTX
Applying AI & Search in Europe - featuring 451 Research
Lucidworks
 
PPTX
Webinar: Accelerate Data Science with Fusion 5.1
Lucidworks
 
PDF
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Lucidworks
 
PPTX
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Lucidworks
 
PPTX
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Lucidworks
 
PPTX
Webinar: Building a Business Case for Enterprise Search
Lucidworks
 
PPTX
Why Insight Engines Matter in 2020 and Beyond
Lucidworks
 
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Lucidworks
 
Drive Agent Effectiveness in Salesforce
Lucidworks
 
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks
 
Connected Experiences Are Personalized Experiences
Lucidworks
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Lucidworks
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Lucidworks
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Lucidworks
 
AI-Powered Linguistics and Search with Fusion and Rosette
Lucidworks
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Lucidworks
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Lucidworks
 
Smart Answers for Employee and Customer Support After COVID-19
Lucidworks
 
Applying AI & Search in Europe - featuring 451 Research
Lucidworks
 
Webinar: Accelerate Data Science with Fusion 5.1
Lucidworks
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Lucidworks
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Lucidworks
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Lucidworks
 
Webinar: Building a Business Case for Enterprise Search
Lucidworks
 
Why Insight Engines Matter in 2020 and Beyond
Lucidworks
 
Ad

Recently uploaded (20)

PDF
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
PDF
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
PPTX
Top Managed Service Providers in Los Angeles
Captain IT
 
PDF
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
Novus Safe Lite- What is Novus Safe Lite.pdf
Novus Hi-Tech
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PDF
Rethinking Security Operations - SOC Evolution Journey.pdf
Haris Chughtai
 
PPTX
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
PDF
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
PPTX
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
PDF
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
PDF
Arcee AI - building and working with small language models (06/25)
Julien SIMON
 
PDF
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
Top Managed Service Providers in Los Angeles
Captain IT
 
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Novus Safe Lite- What is Novus Safe Lite.pdf
Novus Hi-Tech
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
Rethinking Security Operations - SOC Evolution Journey.pdf
Haris Chughtai
 
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
Arcee AI - building and working with small language models (06/25)
Julien SIMON
 
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 

Webinar: Solr 6 Deep Dive - SQL and Graph

  • 3. Solr 6 Deep Dive: SQL and Graph Grant Ingersoll @gsingers CTO, Lucidworks Tim Potter @thelabdude Sr. Software Engineer, Lucidworks
  • 4. • Motivations • Streaming Expressions and Parallel SQL • Graph Capabilities • How does this compare to…? • Future Directions Agenda
  • 6. • Big data systems have grown too complex trying to satisfy a variety of access patterns • Fast primary key lookups / atomic updates (Solr, HBase, Cassandra, …) • Low-latency ranked retrieval (Solr, Elastic, DataStax, …) • Large, distributed table scans (Spark, M/R, Pig, Cassandra, Hive, Impala, …) • Graph traversal (Graphx, Giraph, Neo4j, …) • De-normalization can be inconvenient as related data sets can change at different velocities (movies vs. movie ratings) • Leverage progress made by the Solr community to support big data in Solr using horizontal scalability (shards & replicas) • Don’t forget about speed ~ Search engines in general and Solr in particular are extremely fast! Why Solr needs Parallel Computation
  • 7. Lucidworks Fusion Is Search-Driven Everything •Drive next generation relevance via Content, Collaboration and Context •Harness best in class Open Source: Apache Solr + Spark •Simplify application development and reduce ongoing maintenance CATALOG DYNAMIC NAVIGATION AND LANDING PAGES INSTANT INSIGHTS AND ANALYTICS PERSONALIZED SHOPPING EXPERIENCE PROMOTIONS USER HISTORY Data Acquisition Indexing & Streaming Smart Access API Recommendations &
 Alerts Analytics & InsightsExtreme Relevancy Access data from anywhere to build intelligent, data- driven applications.
  • 8. Fusion Architecture RESTAPI Worker Worker Cluster Mgr. Apache Spark Shards Shards Apache Solr HDFS(Optional) Shared Config Mgmt Leader Election Load Balancing ZK 1 Apache Zookeeper ZK N DATABASEWEBFILELOGSHADOOP CLOUD Connectors Alerting/Messaging NLP Pipelines Blob Storage Scheduling Recommenders/Signals … Core Services Admin UI SECURITY BUILT-IN Lucidworks View
  • 10. • SQL is ubiquitous language for analytics • People: Less training and easier to understand • Tools! Solr as JDBC data source (DbVisualizer, Apache Zeppelin, and SQuirreL SQL) • Query planning / optimization can evolve iteratively SQL is natural extension for Solr’s parallel computing engine
  • 11. Give me the top 5 action movies with rating of 4 or better Mental Warm-up /select?q=*:* &fq=genre_ss:action &fq=rating_i:[4 TO *] &facet=true &facet.limit=5 &facet.mincount=1 &facet.field=title_s SELECT title_s, COUNT(*) as cnt FROM movielens WHERE genre_ss='action' AND rating_i='[4 TO *]’ GROUP BY title_s ORDER BY cnt desc LIMIT 5 {  ...        "facet_counts":{          "facet_fields":{              "title_s":[                  "Star  Wars  (1977)",501,                  "Return  of  the  Jedi  (1983)",379,                  "Godfather,  The  (1972)",351,                  "Raiders  of  the  Lost  Ark  (1981)",348,                  "Empire  Strikes  Back,  The  (1980)",293]},          ...}} {"result-­‐set":{"docs":[   {"title_s":"Star  Wars  (1977)”,"cnt":501},   {"title_s":"Return  of  the  Jedi  (1983)","cnt":379},   {"title_s":"Godfather,  The  (1972)","cnt":351},   {"title_s":"Raiders  of  the  Lost  Ark  (1981)","cnt":348},   {"title_s":"Empire  Strikes  Back,  The  (1980)","cnt":293},   {"EOF":true,"RESPONSE_TIME":42}]}}
  • 12.    SELECT  gender_s,  COUNT(*)  as  num_ratings,  avg(rating_i)  as  avg_rating            FROM  movielens          WHERE  genre_ss='romance'  AND  age_i='[30  TO  *]'   GROUP  BY  gender_s     ORDER  BY  num_ratings  desc SQL Examples    SELECT  title_s,  genre_s,  COUNT(*)  as  num_ratings,  avg(rating_i)  as  avg_rating            FROM  movielens     GROUP  BY  title_s,  genre_s        HAVING  num_ratings  >=  100     ORDER  BY  avg_rating  desc          LIMIT  5    SELECT  DISTINCT(user_id_i)  as  user_id            FROM  movielens          WHERE  genre_ss='documentary'     ORDER  BY  user_id  desc Give me the avg rating for men and women over 30 for romance movies Give me the top 5 rated movies with at least 100 ratings Give me the set of unique users that have rated documentaries
  • 13. • Perform relational operations on streams • Stream sources: search, jdbc, facets, stats, topic, gatherNodes • Stream decorators: complement, daemon, leftOuterJoin, hashJoin, innerJoin, intersect, merge, outerHashJoin, parallel, reduce, random, rollup, select, shortestPath, sort, top, unique, update Streaming Expressions
  • 14. • Relies on docValues (column-oriented data structure) and /export handler • Extreme read performance (8-10x faster than queries using cursorMark) • Facet or map/reduce style aggregation modes • Tiered architecture • SQL interface tier • Worker tier (scale a pool of worker “nodes” independently of the data collection) • Data tier (Solr collection) Streaming API: Nuts and Bolts
  • 15. parallel(workers,      hashJoin(          search(movielens,  q=*:*,                              fl="user_id_i,movie_id_i,rating_i",                              sort="movie_id_i  asc",                              partitionKeys="movie_id_i"),          hashed=search(movielens_movies,  q=*:*,                                                fl="movie_id_i,title_s,genre_s",                                                sort="movie_id_i  asc",                                                partitionKeys="movie_id_i"),          on="movie_id_i"      ),      workers="4",      sort="movie_id_i  asc"   )   Streaming Expression Example: hashJoin The small “right” side of the join gets loaded into memory on each worker node Each shard queried by N workers, so 4 workers x 4 shards means 16 queries (usually all replicas per shard are hit) Workers collection isolates parallel computation nodes from data nodes
  • 16. Aggregation Modes • Map/Reduce aggregationMode — for high cardinality aggregations and distributed joins (requires a shuffle phase to move keys to correct worker) curl  -­‐-­‐data-­‐urlencode  "stmt=SELECT  user_id_i,  avg(rating_i)  as  avg_rating  FROM  movielens  GROUP  BY  user_id_i"     “http://host:port/solr/movielens/sql?aggregationMode=map_reduce”   • Facet aggregationMode — Uses JSON facet engine for high performance on low-to-moderate cardinality fields (e.g. movies) curl  -­‐-­‐data-­‐urlencode  "stmt=SELECT  movie_id_i,  avg(rating_i)  as  avg_rating  FROM  movielens  GROUP  BY  movie_id_i"        “http://host:port/solr/movielens/sql?aggregationMode=facet”
  • 17. • spark-solr project uses streaming API to pull data from Solr into Spark jobs if docValues enabled, see: https://github.com/lucidworks/spark-solr • Perform aggregations of “signals”, e.g clicks, to compute boosts and recommendations using Spark • Custom Scala script jobs to perform complex analysis on data in Solr, e.g. sessionize request logs • Power rich data visualizations using Spark SQL over Solr streaming aggregations How we use Solr streaming API in Fusion
  • 18. Graph
  • 19. • Anomaly detection and fraud detection • Recommenders • Social network analysis • Graph Search • Access Control • Examples: • Find all tweets mentioning “Solr” by me or people I follow • Find all draft blog posts about “Parallel SQL” written by a developer • Find 3-star hotels in NYC my friends stayed in last year Graph Use Cases
  • 20. • Some data is much more naturally represented as a graph structure • Traditionally hard to deal with in search’s inverted index • Solr 6.0 introduces the Graph Query Parser • Solr 6.1 brings Graph Streaming expressions Graph Basics
  • 21. • Query-time, cyclic aware graph traversal is able to rank documents based on relationships • Provides controls for depth, filtering of results and inclusion of root and/or leaves • Limitations: single node/shard only • Examples: • http://localhost:8983/solr/graph/query? fl=id,score&q={!graph+from=in_edge+to=out_edge} id:A • http://localhost:8983/solr/my_graph/query?fl=id&q={! graph+from=in_edge+to=out_edge +traversalFilter='foo:[*+TO+15]'}id:A • http://localhost:8983/solr/my_graph/query?fl=id&q={! graph+from=in_edge+to=out_edge+maxDepth=1}foo: [*+TO+10] Graph Query Parser
  • 22. •Part of Solr’s broader Streaming Expressions capability •Implements a powerful, breadth-first traversal •Works across shards AND collections •Supports aggregations •Cycle aware Graph Streaming Expressions (Solr 6.1) curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d ‘expr=…’ "http://localhost: 18984/solr/movielens/stream"
  • 23. All movies that user 389 watched expr:gatherNodes(movielens,walk="389->user_id_i",gather="movie_id_i")
  • 24. All the Movies that viewers of Movie 161 watched expr:gatherNodes(movielens, gatherNodes(movielens,walk="161->movie_id_i",gather="user_id_i"), walk="node->user_id_i",gather="movie_id_i", trackTraversal="true") Movie 161: “The Air Up There”
  • 25. Collaborative Filtering Example expr=top(n="5", sort="count(*) desc", gatherNodes(movielens, top(n="30", sort="count(*) desc", gatherNodes(movielens, search(movielens, q="user_id_i:305", fl="movie_id_i", sort="movie_id_i asc", qt=“/export"), walk="movie_id_i->movie_id_i", gather="user_id_i", maxDocFreq="10000", count(*))), walk="node->user_id_i", gather="movie_id_i", count(*)))'
  • 27. Comparing Graph Choices Solr Elastic Graph Neo4J Spark GraphX Best Use Case QParser: predef. relationships as filters Expressions: fast, query-based, dist. graph ops Term relationship exploration Graph ops and querying that fit on a single node Large-scale, iterative graph ops Common Graph Algorithms (e.g. Pregel, Traversal) Partial No Yes Yes Scaling QParser: no Expressions: yes Yes Master/Replica Yes Commercial License Required No Yes GPLv3 No Visualizations GraphML support (Gephi) Kibana Neo4j browser 3rd party
  • 28. Comparing Big Data SQL Choices Solr Hive Drill SparkSQL Secret Sauce Push complex query constructs into engine (full text, spatial, functions, etc) Mature SQL solution for Hadoop stack Execute SQL over NoSQL data sources Spark core (optimized shuffle, in-memory, etc), integration of other APIs: ML, Streaming, GraphX SQL Features Evolving Mature Maturing Maturing Scaling Linear (shards and replicas) backed by inverted index Limited by Hadoop infrastructure (table scans) Good, but need to benchmark Memory intensive; Scale out using Spark cluster, backed by RDDs Integration w/ external systems JDBC stream source external tables / plugin API many drivers available DataSource API, many systems supported
  • 30. • Alternate graph traversal approaches, e.g. depth-first • Possible support for Gremlin (Graph Traversal Language from Tinker Pop) • Additional graph algorithms (e.g. strongly conn. components, page rank) Future Work
  • 31. • No support for pushing >, >=, <, <= operators in WHERE clause down into Solr as range queries; use range syntax [4 TO *] for now • Using Solr function queries in WHERE clause, e.g. WHERE  location_p='{!geofilt  d=90   pt=37.773972,-­‐122.431297  sfield=location_p}’   • SQL Joins (SOLR-8593) • Port SQL layer to use Apache Calcite vs. Presto SQL: Current Limitations and Future Plans