Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

Your Big Data Stack is Too Big!
Timothy Potter
Apache Lucene/Solr Committer & PMC Member
Lucidworks

3
01
How we got here and where we’re going …
• Giving away 2 books ~ tweet including:
#FusionBigData #LuceneSolrRev
• A quick trip down memory lane …
Cassandra, Pig, Hive, HCatalog, HDFS, Mahout,
Sqoop, Oozie, Storm, and of course Solr!
• Big Data integration trap
• Lucidworks Fusion provides a viable alternative that
emphasizes fast access, agility, and automation

4
03
A few patterns emerge …
• Begins with need for better relevancy ~ automatically
• More and more mission-critical data lives in Fusion
• Much of big data is unstructured making search the ideal
exploration technology ~ people grok search
• Speed is addictive!
• But integrating these two is a non-trivial problem to solve ->
Fusion FTW!
• fusion-spark-bootcamp: http://bit.ly/2dZfBhk

5
01
Data Ingest
• Connectors! Lots of them …
• Pipelines … because data ingest is messy
• JavaScript when you must!
• SparkSQL too! Replace DIH with SparkSQL JDBC
datasource: 31K docs / sec on a small Spark cluster
gist.github.com/kiranchitturi/
0be62fc13e4ec7f9ae5def53180ed181
• Spark Streaming to Solr too

6
01
Time-based Partitioning
• Docs partitioned into time-based collections in Solr
• New time partitions created on-the-ﬂy when needed; older
partitions should age out automatically
• Need a document router to index docs in the correct collection
based on timestamp (doesn’t use aliases)
• Need a query router to read the appropriate collections based
on query time range
• Deeper analytics on larger historical time ranges achieved
using Spark by joining Solr with archived ﬁles stored in HDFS
• Check out the eventsim lab in the bootcamp

7
02
Common access patterns with big data
• Big data systems have grown complex trying to satisfy a
variety of access patterns
• Fast primary key lookups / atomic updates (Solr,
HBase, Cassandra, …)
• Low-latency ranked retrieval and facet-driven
discovery (Solr, Elastic, DataStax, …)
• Large, distributed table scans (Spark, M/R, Pig,
Cassandra, Hive, Impala, …)
• Graph traversal (Graphx, Giraph, Neo4j, …)

8
01
Solr Streaming Inside
• Relies on docValues (column-oriented data
structure) and /export handler
• Extreme read performance (8-10x faster than
queries using cursorMark)
• Facet or map/reduce style aggregation modes
• Tiered architecture
• SQL interface tier
• Worker tier (scale a pool of worker “nodes”
independently of the data collection)
• Data tier (Solr collection)

9
01
Fusion Signals for Relevance
• Simple DSL for aggregating user interactions
with search results, quite useful for boosting &
recommendations
• Scale using Spark
• Take user activity and feed it back into the search
engine to improve relevancy using Fusion query
pipelines
• Integrated with Lucidworks View to capture user
activity
• Custom logic via JavaScript … don’t get bogged
down into the weeds of Spark

10
01
Self-service Analytics
• Can’t overstate the importance of SQL in big data
• Shortage of data scientists and engineers, abundance of
SQL-savvy business analysts
• JDBC-compliant Tools abound!
• De-normalization is inconvenient
• Apache Zeppelin for exploring data in Solr and other data
sources

11
01
Best of Both Worlds: Spark SQL and Solr SQL
• Spark SQL provides an amazing query plan optimizer
with SQL2003 support
• BUT … Spark SQL can’t compete with Solr performance
for queries that can be expressed in Solr
• Push-down aggregations into the engine!
• spark-solr tries to detect when sub-queries can be
pushed down into Solr
• movielens lab in fusion-spark-bootcamp
https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

12
01
Fusion Catalog API
• REST API for CRUD on data assets: views, tables, UDFs, etc
• Full-text search for business analysts to ﬁnd data sets of
interest
• Tool for SMEs to share complex data sets as simple views
• Authn & Authz via Fusion security
• Seamless integration with SparkSQL, streaming expressions,
parallel SQL, and JDBC
parallel(workers,

hashJoin(

search(movielens,
q=*:*,

fl="user_id_i,movie_i

sort="movie_id_i
asc"

partitionKeys="movie_

hashed=search(movielens_movies

fl="movie_id

sort="movie_

partitionKey

on="movie_id_i"

),

workers="4",

sort="movie_id_i
asc"

)

13
01
Custom Script Jobs
• Not limited by our built-in toolset
• Develop a custom Spark script in Scala and then
upload it Fusion to be scheduled and run on Spark
cluster
• Focus more on solving business problems vs. ops /
job mgmt
• See apachelogs example in the fusion-spark-
bootcamp
sessionize using window function and then
compute aggregations for each session

14
01
Data science in a box
• REPL with hooks into Solr for quickly exploring
unstructured data sets
• Jake's RecSys recipe for building recommender
systems
• Full access to Lucene text analyzers when building
ML pipelines
• See mlsvm & ml20news labs in the fusion-spark-
bootcamp
• searchhub.lucidworks.com
see slides from Grant’s talk about SearchHub

15
01
Machine Learning in Index & Query Pipelines
• Query intent
• Document classiﬁcation
• Recommendations
• Design / evaluate / reﬁne models in
Spark ML pipelines or MLlib and then
publish to Fusion to generate
predictions from query / index
pipelines

ID#of#model#stored#
in#Fusion’s#blob#store#
Field#to#store#model#
predic5on#in#each#
document#during#indexing#
16
01
Example: Sentiment Classiﬁer during Indexing

17
03
You could do this yourself …
• It’s too easy to fallback into the trap of thinking
that hard work getting cool technologies working
together equates to business value.
• Get back to focusing on solving business
problems ~ increased ROI, faster
• Fusion gives you a clear buy vs. build choice

Billions of Docs
Optional
REST
Security woven
throughout
Proxy
Recs
Worker
Pipes Metrics
NLP Sched.
Blobs Admin
Connectors
Worker Cluster Mgr.
Spark
Shards Shards
Solr
HDFS
Shared Conﬁg
Mgmt
Leader
Election
Load
Balancing
ZK 1
Zookeeper
ZK N
Signals
Fusion Architecture
Millions of Users

19
01
Thanks! Q & A
• Try Fusion: https://lucidworks.com/products/fusion/download/
• spark-solr: http://bit.ly/1Ub12GU
• fusion-spark-bootcamp: http://bit.ly/2dZfBhk
• 40% off Manning books coupon code: ctwlucsoltw

Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks (20)

More from Lucidworks (20)

Recently uploaded (20)

Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks