Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using Solr

BUILD A SUPERCHARGED DATAMART
WITH SOLR
Elliott Cordo
Chief Architect, Caserta Concepts

What is Solr
• Solr is an open source search platform
• Based on Core Lucene search technology
• Bundled up with an API, Management Tools, UI, scalability

But isn’t it just a search engine
• Although Solr was primarily architected as a search
engine, there is no reason you can’t use it as a database
• Search based application movement promotes search
engine as a data store
• Search has long been a “cheap” option fast and
interactive queries NoSQL and Hadoop datastores

So why would we use it
• Solr is fast –
• expect low ms response times on simple lookups
• properly tuned even complex queries will take less than 100ms
• Solr scales
• High concurrency
• Scales horizontally and vertically (larger hardware)

And it has the best query flexibility of any
NOSQL store!
..and in many cases RDBMS
• Grouping and Aggregation via Facets
• Fuzzy Search
• Equality and Range queries
• Geospatial capabilities
• HIGHLY extensible!

Another Datastore to Manage??
• Polygot persistence/polygot programming
• Feature/function will drive which technology should be
used
• Use the right tool for the job: Relational, MPP, Hadoop,
Graph, KV, NOSQL
Thankfully Solr is pretty easy to learn and manage!

When it works well
• Search is front and center  end users need to fuzzy
search dimensional attributes
• Flexible /Sparse schema
• Need for speed -> faster queries for more user
engagement
• Concurrency -> fast queries on client facing or open web

Use Cases
• Real time analytics  ingest incomming events from
Flume/Logstash/Custom app
• Supplement NOSQL, MPP, or Hadoop analytics
• Web facing analytics DB

What it doesn’t do well
• Joins across collections/cores (tables)
• Complex arbitrary queries
• Limited integration to standard ETL and BI frameworks

How do you get data in?
• A robust API
• Modules and libraries for just about any programing
language
• Index data in any DB via JDBC
• Pull in XML and Delimited files with Simple Posting Tool
• Flume/Logstash
NOTE: that like it NOSQL cousins, data needs to be
Flattened!

How do you interact with Solr?
HTTP, Nice concise query language
http://localhost:8983/solr/collection1/select?q=city:Yuma&wt=json&indent=true
And what does the response look like:
"responseHeader":{
"status":0,
"QTime":1,
"params":{
"indent":"true",
"q":"city:Yuma",
"wt":"json"}},
"response":{"numFound":6,"start":0,"docs":[
{
"review_id":"JhUliQTD9iyGWov2nv-ZJA",
"stars":2,
"review_date":"2009-09-23T00:00:00Z",
"business_id":"gKRUdbTPBZ7kwBRCeZDDWA",
"business_name":"Wingate By Wyndham",
"city":"Yuma",
"state":"AZ",
"longitude":"-112.09343969999999",
"latitude":"33.434925100000001",
"user_id":"AqlZdDD7NK1fpQi9ltqIXQ",
"user_name":"Studl",
"_version_":1475098783569149953},

So what about analytic queries
select city, count(1)
from reviews
where state=‘AZ’
group by city
http://localhost:8983/solr/collection1/select?q=state:AZ&wt=json&indent
=true&facet=true&facet.field=city&rows=0&facet.mincount=1&facet.limit
=-1

More query fun
select city, count(1)
from reviews
where state=‘AZ’
and review_date between ‘2012-03-01’ and ‘2012-03-06’
group by city
having count(1)>=20
http://localhost:8983/solr/collection1/select?q=state:AZ+review_date:%
5B2012-03-01T23:59:59.999Z TO 2012-03-
06T00:00:00Z%5D&wt=json&indent=true&facet=true&facet.field=city&r
ows=0&facet.mincount=20&facet.limit=-1

Facet stats give you aggregation
http://localhost:8983/solr/collection1/select?q=state:AZ+review_date:%5B2012-
03-01T23:59:59.999Z%20TO%202012-03-
06T00:00:00Z%5D&wt=json&indent=true&facet=true&rows=0&facet.mincount=
1&facet.limit=-1&stats=true&stats.field=stars&stats.facet=city
"stats":{
"stats_fields":{
"stars":{
"min":1.0,
"max":5.0,
"count":991,
"missing":0,
"sum":3685.0,
"sumOfSquares":15313.0,
"mean":3.7184661957618568,
"stddev":1.2754290498612053,
"facets":{
"city":{
"Peoria":{
"min":1.0,
"max":5.0,
"count":14,
"missing":0,
"sum":54.0,
"sumOfSquares":234.0,
"mean":3.857142857142857,
"stddev":1.4064216928154862,
"facets":{}},
"Goodyear":{
"min":2.0,
"max":5.0,
"count":7,
….

Facet pivots too!
http://localhost:8983/solr/collection1/select?q=state:AZ+review_date:%5B2012-
03-01T23:59:59.999Z%20TO%202012-03-
06T00:00:00Z%5D&wt=json&indent=true&facet=true&rows=0&facet.mincount=
1&facet.limit=-1&facet.pivot=city,business_name
"facet_pivot":{
"city,business_name":[{
"field":"city",
"value":"Anthem",
"count":3,
"pivot":[{
"field":"business_name",
"value":"Outlets At Anthem",
"count":1},
{
"value":"Q to U BBQ",
"count":1},
{
"value":"Shanghai Club",
"count":1}]},
{
"field":"city",
"value":"Apache Junction",
"count":1,
"pivot":[{
"value":"Lost Dutchman State Park",
"count":1}]},

UI please!
• Roll your own  it’s not that hard
• How about using Python Flask to render a Solr Response
to D3 or Google Charts
• Sometimes a custom solution is the best option

And the really easy way
Banana – A Solr port of Kibana!
Why should Elasticache fans have all the fun?
And it’s open source!
https://github.com/LucidWorks/banana

Banana
• An AngularJS app (pure javascript, runs in any
browser)
• Make a pretty dashboard with no development in a
couple minutes
• Very user friendly, users can create their own
content

https://github.com/Caserta-Concepts/Solr-Datamart
elliott@casertaconcepts.com

Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using Solr

More Related Content

What's hot (20)

Similar to Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using Solr (20)

More from Caserta (20)

Recently uploaded (20)

Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using Solr