SlideShare a Scribd company logo
O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
Your Big Data Stack is Too Big!
Timothy Potter
Apache Lucene/Solr Committer & PMC Member
Lucidworks
3
01
How we got here and where we’re going …
• Giving away 2 books ~ tweet including:
#FusionBigData #LuceneSolrRev
• A quick trip down memory lane …
Cassandra, Pig, Hive, HCatalog, HDFS, Mahout,
Sqoop, Oozie, Storm, and of course Solr!
• Big Data integration trap
• Lucidworks Fusion provides a viable alternative that
emphasizes fast access, agility, and automation
4
03
A few patterns emerge …
• Begins with need for better relevancy ~ automatically
• More and more mission-critical data lives in Fusion
• Much of big data is unstructured making search the ideal
exploration technology ~ people grok search
• Speed is addictive!
• But integrating these two is a non-trivial problem to solve ->
Fusion FTW!
• fusion-spark-bootcamp: http://bit.ly/2dZfBhk
5
01
Data Ingest
• Connectors! Lots of them …
• Pipelines … because data ingest is messy
• JavaScript when you must!
• SparkSQL too! Replace DIH with SparkSQL JDBC
datasource: 31K docs / sec on a small Spark cluster
gist.github.com/kiranchitturi/
0be62fc13e4ec7f9ae5def53180ed181
• Spark Streaming to Solr too
6
01
Time-based Partitioning
• Docs partitioned into time-based collections in Solr
• New time partitions created on-the-fly when needed; older
partitions should age out automatically
• Need a document router to index docs in the correct collection
based on timestamp (doesn’t use aliases)
• Need a query router to read the appropriate collections based
on query time range
• Deeper analytics on larger historical time ranges achieved
using Spark by joining Solr with archived files stored in HDFS
• Check out the eventsim lab in the bootcamp
7
02
Common access patterns with big data
• Big data systems have grown complex trying to satisfy a
variety of access patterns
• Fast primary key lookups / atomic updates (Solr,
HBase, Cassandra, …)
• Low-latency ranked retrieval and facet-driven
discovery (Solr, Elastic, DataStax, …)
• Large, distributed table scans (Spark, M/R, Pig,
Cassandra, Hive, Impala, …)
• Graph traversal (Graphx, Giraph, Neo4j, …)
8
01
Solr Streaming Inside
• Relies on docValues (column-oriented data
structure) and /export handler
• Extreme read performance (8-10x faster than
queries using cursorMark)
• Facet or map/reduce style aggregation modes
• Tiered architecture
• SQL interface tier
• Worker tier (scale a pool of worker “nodes”
independently of the data collection)
• Data tier (Solr collection)
9
01
Fusion Signals for Relevance
• Simple DSL for aggregating user interactions
with search results, quite useful for boosting &
recommendations
• Scale using Spark
• Take user activity and feed it back into the search
engine to improve relevancy using Fusion query
pipelines
• Integrated with Lucidworks View to capture user
activity
• Custom logic via JavaScript … don’t get bogged
down into the weeds of Spark
10
01
Self-service Analytics
• Can’t overstate the importance of SQL in big data
• Shortage of data scientists and engineers, abundance of
SQL-savvy business analysts
• JDBC-compliant Tools abound!
• De-normalization is inconvenient
• Apache Zeppelin for exploring data in Solr and other data
sources
11
01
Best of Both Worlds: Spark SQL and Solr SQL
• Spark SQL provides an amazing query plan optimizer
with SQL2003 support
• BUT … Spark SQL can’t compete with Solr performance
for queries that can be expressed in Solr
• Push-down aggregations into the engine!
• spark-solr tries to detect when sub-queries can be
pushed down into Solr
• movielens lab in fusion-spark-bootcamp
https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
12
01
Fusion Catalog API
• REST API for CRUD on data assets: views, tables, UDFs, etc
• Full-text search for business analysts to find data sets of
interest
• Tool for SMEs to share complex data sets as simple views
• Authn & Authz via Fusion security
• Seamless integration with SparkSQL, streaming expressions,
parallel SQL, and JDBC
parallel(workers,	
  
	
  	
  hashJoin(	
  
	
  	
  	
  	
  search(movielens,	
  q=*:*,	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  fl="user_id_i,movie_i
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sort="movie_id_i	
  asc"
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  partitionKeys="movie_
	
  	
  	
  	
  hashed=search(movielens_movies
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  fl="movie_id
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sort="movie_
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  partitionKey
	
  	
  	
  	
  on="movie_id_i"	
  
	
  	
  ),	
  
	
  	
  workers="4",	
  
	
  	
  sort="movie_id_i	
  asc"	
  
)	
  
13
01
Custom Script Jobs
• Not limited by our built-in toolset
• Develop a custom Spark script in Scala and then
upload it Fusion to be scheduled and run on Spark
cluster
• Focus more on solving business problems vs. ops /
job mgmt
• See apachelogs example in the fusion-spark-
bootcamp
sessionize using window function and then
compute aggregations for each session
14
01
Data science in a box
• REPL with hooks into Solr for quickly exploring
unstructured data sets
• Jake's RecSys recipe for building recommender
systems
• Full access to Lucene text analyzers when building
ML pipelines
• See mlsvm & ml20news labs in the fusion-spark-
bootcamp
• searchhub.lucidworks.com
see slides from Grant’s talk about SearchHub
15
01
Machine Learning in Index & Query Pipelines
• Query intent
• Document classification
• Recommendations
• Design / evaluate / refine models in
Spark ML pipelines or MLlib and then
publish to Fusion to generate
predictions from query / index
pipelines
ID#of#model#stored#
in#Fusion’s#blob#store#
Field#to#store#model#
predic5on#in#each#
document#during#indexing#
16
01
Example: Sentiment Classifier during Indexing
17
03
You could do this yourself …
• It’s too easy to fallback into the trap of thinking
that hard work getting cool technologies working
together equates to business value.
• Get back to focusing on solving business
problems ~ increased ROI, faster
• Fusion gives you a clear buy vs. build choice
Billions of Docs
Optional
REST
Security woven
throughout
Proxy
Recs
Worker
Pipes Metrics
NLP Sched.
Blobs Admin
Connectors
Worker Cluster Mgr.
Spark
Shards Shards
Solr
HDFS
Shared Config
Mgmt
Leader
Election
Load
Balancing
ZK 1
Zookeeper
ZK N
Signals
Fusion Architecture
Millions of Users
19
01
Thanks! Q & A
• Try Fusion: https://lucidworks.com/products/fusion/download/
• spark-solr: http://bit.ly/1Ub12GU
• fusion-spark-bootcamp: http://bit.ly/2dZfBhk
• 40% off Manning books coupon code: ctwlucsoltw

More Related Content

PDF
Solr4 nosql search_server_2013
Lucidworks (Archived)
 
PDF
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
Lucidworks
 
PDF
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Lucidworks
 
PDF
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
Lucidworks
 
PDF
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
Lucidworks
 
PDF
Data Engineering with Solr and Spark
Lucidworks
 
PDF
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lucidworks
 
PDF
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Lucidworks
 
Solr4 nosql search_server_2013
Lucidworks (Archived)
 
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
Lucidworks
 
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Lucidworks
 
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
Lucidworks
 
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
Lucidworks
 
Data Engineering with Solr and Spark
Lucidworks
 
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lucidworks
 
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Lucidworks
 

What's hot (20)

PDF
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Lucidworks
 
PDF
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Lucidworks
 
PDF
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Lucidworks
 
PDF
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Lucidworks
 
PDF
Parallel SQL and Streaming Expressions in Apache Solr 6
Shalin Shekhar Mangar
 
PDF
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
Lucidworks
 
PDF
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
Lucidworks
 
PDF
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Lucidworks
 
PDF
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Lucidworks
 
ODP
Get involved with the Apache Software Foundation
Shalin Shekhar Mangar
 
PDF
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Lucidworks
 
PDF
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Lucidworks
 
PDF
Webinar: What's New in Solr 6
Lucidworks
 
PDF
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Lucidworks
 
PPTX
Real Time search using Spark and Elasticsearch
Sigmoid
 
PDF
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Lucidworks
 
PPTX
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
PDF
Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...
Lucidworks
 
PPTX
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
PDF
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Lucidworks
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Lucidworks
 
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Lucidworks
 
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Lucidworks
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Lucidworks
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Shalin Shekhar Mangar
 
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
Lucidworks
 
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
Lucidworks
 
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Lucidworks
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Lucidworks
 
Get involved with the Apache Software Foundation
Shalin Shekhar Mangar
 
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Lucidworks
 
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Lucidworks
 
Webinar: What's New in Solr 6
Lucidworks
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Lucidworks
 
Real Time search using Spark and Elasticsearch
Sigmoid
 
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Lucidworks
 
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...
Lucidworks
 
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Lucidworks
 
Ad

Viewers also liked (20)

PDF
Big Data Tech Stack
Abdullah Çetin ÇAVDAR
 
PPTX
Big data stack tecnologico
Massimo Romano
 
PDF
Big Data Hadoop Training Course
RMS Software Technologies
 
PDF
Nice Docs Finish First - Designing Search Ranking for Fairness at Etsy: Prese...
Lucidworks
 
PPTX
The Big Data Stack
Zubair Nabi
 
PDF
Aesop change data propagation
Regunath B
 
PDF
Events, Signals, and Recommendations
Lucidworks
 
PDF
Etsy Search: How We Index and Query 26 Million One-of-a-kind Items
C4Media
 
PDF
Evolving Search Relevancy: Presented by James Strassburg, Direct Supply
Lucidworks
 
PDF
Search@flipkart
Umesh Prasad
 
PDF
Building tiered data stores using aesop to bridge sql and no sql systems
Regunath B
 
PDF
Netflix Global Search - Lucene Revolution
ivan provalov
 
PDF
It's Just Search: Presented by Erik Hatcher, Lucidworks
Lucidworks
 
PDF
Search At AstraZeneca. An Agile AppStore (search-based apps) Created On A Ric...
Nick Brown
 
PDF
Fusion 3 Overview Webinar
Lucidworks
 
PDF
Solr & Lucene @ Etsy by Gregg Donovan
Gregg Donovan
 
PDF
Webinar: Ecommerce, Rules, and Relevance
Lucidworks
 
PDF
Coffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, Flax
Lucidworks
 
PDF
E commerce data migration in moving systems across data centres
Regunath B
 
PDF
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Lucidworks
 
Big Data Tech Stack
Abdullah Çetin ÇAVDAR
 
Big data stack tecnologico
Massimo Romano
 
Big Data Hadoop Training Course
RMS Software Technologies
 
Nice Docs Finish First - Designing Search Ranking for Fairness at Etsy: Prese...
Lucidworks
 
The Big Data Stack
Zubair Nabi
 
Aesop change data propagation
Regunath B
 
Events, Signals, and Recommendations
Lucidworks
 
Etsy Search: How We Index and Query 26 Million One-of-a-kind Items
C4Media
 
Evolving Search Relevancy: Presented by James Strassburg, Direct Supply
Lucidworks
 
Search@flipkart
Umesh Prasad
 
Building tiered data stores using aesop to bridge sql and no sql systems
Regunath B
 
Netflix Global Search - Lucene Revolution
ivan provalov
 
It's Just Search: Presented by Erik Hatcher, Lucidworks
Lucidworks
 
Search At AstraZeneca. An Agile AppStore (search-based apps) Created On A Ric...
Nick Brown
 
Fusion 3 Overview Webinar
Lucidworks
 
Solr & Lucene @ Etsy by Gregg Donovan
Gregg Donovan
 
Webinar: Ecommerce, Rules, and Relevance
Lucidworks
 
Coffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, Flax
Lucidworks
 
E commerce data migration in moving systems across data centres
Regunath B
 
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Lucidworks
 
Ad

Similar to Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks (20)

PPTX
Webinar: Solr & Fusion for Big Data
Lucidworks
 
PDF
Webinar: Fusion 3.1 - What's New
Lucidworks
 
PPTX
Practical Machine Learning for Smarter Search with Spark+Solr
Jake Mannix
 
PPTX
Practical Machine Learning for Smarter Search with Solr and Spark
Jake Mannix
 
PPTX
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
PPTX
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
MLconf
 
PDF
Webinar: Fusion for Data Science
Lucidworks
 
PDF
Solr as a Spark SQL Datasource
Chitturi Kiran
 
PDF
Data Science with Solr and Spark
Lucidworks
 
PDF
Introduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Lucidworks
 
PDF
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Lucidworks
 
PDF
Solr for Data Science
Grant Ingersoll
 
PDF
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
Lucidworks
 
PDF
New Developments in Spark
Databricks
 
PDF
Webinar: Search and Recommenders
Lucidworks
 
PDF
Real World Analytics with Solr Cloud and Spark
QAware GmbH
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
PPTX
Leveraging Solr and Mahout
Grant Ingersoll
 
PDF
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
PDF
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 
Webinar: Solr & Fusion for Big Data
Lucidworks
 
Webinar: Fusion 3.1 - What's New
Lucidworks
 
Practical Machine Learning for Smarter Search with Spark+Solr
Jake Mannix
 
Practical Machine Learning for Smarter Search with Solr and Spark
Jake Mannix
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
MLconf
 
Webinar: Fusion for Data Science
Lucidworks
 
Solr as a Spark SQL Datasource
Chitturi Kiran
 
Data Science with Solr and Spark
Lucidworks
 
Introduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Lucidworks
 
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Lucidworks
 
Solr for Data Science
Grant Ingersoll
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
Lucidworks
 
New Developments in Spark
Databricks
 
Webinar: Search and Recommenders
Lucidworks
 
Real World Analytics with Solr Cloud and Spark
QAware GmbH
 
Understanding Query Plans and Spark UIs
Databricks
 
Leveraging Solr and Mahout
Grant Ingersoll
 
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 

More from Lucidworks (20)

PDF
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Lucidworks
 
PDF
Drive Agent Effectiveness in Salesforce
Lucidworks
 
PPTX
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks
 
PPTX
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks
 
PPTX
Connected Experiences Are Personalized Experiences
Lucidworks
 
PDF
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 
PPTX
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Lucidworks
 
PPTX
Preparing for Peak in Ecommerce | eTail Asia 2020
Lucidworks
 
PPTX
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Lucidworks
 
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
Lucidworks
 
PDF
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Lucidworks
 
PPTX
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Lucidworks
 
PDF
Smart Answers for Employee and Customer Support After COVID-19
Lucidworks
 
PPTX
Applying AI & Search in Europe - featuring 451 Research
Lucidworks
 
PPTX
Webinar: Accelerate Data Science with Fusion 5.1
Lucidworks
 
PDF
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Lucidworks
 
PPTX
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Lucidworks
 
PPTX
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Lucidworks
 
PPTX
Webinar: Building a Business Case for Enterprise Search
Lucidworks
 
PPTX
Why Insight Engines Matter in 2020 and Beyond
Lucidworks
 
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Lucidworks
 
Drive Agent Effectiveness in Salesforce
Lucidworks
 
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks
 
Connected Experiences Are Personalized Experiences
Lucidworks
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Lucidworks
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Lucidworks
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Lucidworks
 
AI-Powered Linguistics and Search with Fusion and Rosette
Lucidworks
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Lucidworks
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Lucidworks
 
Smart Answers for Employee and Customer Support After COVID-19
Lucidworks
 
Applying AI & Search in Europe - featuring 451 Research
Lucidworks
 
Webinar: Accelerate Data Science with Fusion 5.1
Lucidworks
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Lucidworks
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Lucidworks
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Lucidworks
 
Webinar: Building a Business Case for Enterprise Search
Lucidworks
 
Why Insight Engines Matter in 2020 and Beyond
Lucidworks
 

Recently uploaded (20)

PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 

Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

  • 1. O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
  • 2. Your Big Data Stack is Too Big! Timothy Potter Apache Lucene/Solr Committer & PMC Member Lucidworks
  • 3. 3 01 How we got here and where we’re going … • Giving away 2 books ~ tweet including: #FusionBigData #LuceneSolrRev • A quick trip down memory lane … Cassandra, Pig, Hive, HCatalog, HDFS, Mahout, Sqoop, Oozie, Storm, and of course Solr! • Big Data integration trap • Lucidworks Fusion provides a viable alternative that emphasizes fast access, agility, and automation
  • 4. 4 03 A few patterns emerge … • Begins with need for better relevancy ~ automatically • More and more mission-critical data lives in Fusion • Much of big data is unstructured making search the ideal exploration technology ~ people grok search • Speed is addictive! • But integrating these two is a non-trivial problem to solve -> Fusion FTW! • fusion-spark-bootcamp: http://bit.ly/2dZfBhk
  • 5. 5 01 Data Ingest • Connectors! Lots of them … • Pipelines … because data ingest is messy • JavaScript when you must! • SparkSQL too! Replace DIH with SparkSQL JDBC datasource: 31K docs / sec on a small Spark cluster gist.github.com/kiranchitturi/ 0be62fc13e4ec7f9ae5def53180ed181 • Spark Streaming to Solr too
  • 6. 6 01 Time-based Partitioning • Docs partitioned into time-based collections in Solr • New time partitions created on-the-fly when needed; older partitions should age out automatically • Need a document router to index docs in the correct collection based on timestamp (doesn’t use aliases) • Need a query router to read the appropriate collections based on query time range • Deeper analytics on larger historical time ranges achieved using Spark by joining Solr with archived files stored in HDFS • Check out the eventsim lab in the bootcamp
  • 7. 7 02 Common access patterns with big data • Big data systems have grown complex trying to satisfy a variety of access patterns • Fast primary key lookups / atomic updates (Solr, HBase, Cassandra, …) • Low-latency ranked retrieval and facet-driven discovery (Solr, Elastic, DataStax, …) • Large, distributed table scans (Spark, M/R, Pig, Cassandra, Hive, Impala, …) • Graph traversal (Graphx, Giraph, Neo4j, …)
  • 8. 8 01 Solr Streaming Inside • Relies on docValues (column-oriented data structure) and /export handler • Extreme read performance (8-10x faster than queries using cursorMark) • Facet or map/reduce style aggregation modes • Tiered architecture • SQL interface tier • Worker tier (scale a pool of worker “nodes” independently of the data collection) • Data tier (Solr collection)
  • 9. 9 01 Fusion Signals for Relevance • Simple DSL for aggregating user interactions with search results, quite useful for boosting & recommendations • Scale using Spark • Take user activity and feed it back into the search engine to improve relevancy using Fusion query pipelines • Integrated with Lucidworks View to capture user activity • Custom logic via JavaScript … don’t get bogged down into the weeds of Spark
  • 10. 10 01 Self-service Analytics • Can’t overstate the importance of SQL in big data • Shortage of data scientists and engineers, abundance of SQL-savvy business analysts • JDBC-compliant Tools abound! • De-normalization is inconvenient • Apache Zeppelin for exploring data in Solr and other data sources
  • 11. 11 01 Best of Both Worlds: Spark SQL and Solr SQL • Spark SQL provides an amazing query plan optimizer with SQL2003 support • BUT … Spark SQL can’t compete with Solr performance for queries that can be expressed in Solr • Push-down aggregations into the engine! • spark-solr tries to detect when sub-queries can be pushed down into Solr • movielens lab in fusion-spark-bootcamp https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
  • 12. 12 01 Fusion Catalog API • REST API for CRUD on data assets: views, tables, UDFs, etc • Full-text search for business analysts to find data sets of interest • Tool for SMEs to share complex data sets as simple views • Authn & Authz via Fusion security • Seamless integration with SparkSQL, streaming expressions, parallel SQL, and JDBC parallel(workers,      hashJoin(          search(movielens,  q=*:*,                              fl="user_id_i,movie_i                          sort="movie_id_i  asc"                          partitionKeys="movie_        hashed=search(movielens_movies                                            fl="movie_id                                            sort="movie_                                            partitionKey        on="movie_id_i"      ),      workers="4",      sort="movie_id_i  asc"   )  
  • 13. 13 01 Custom Script Jobs • Not limited by our built-in toolset • Develop a custom Spark script in Scala and then upload it Fusion to be scheduled and run on Spark cluster • Focus more on solving business problems vs. ops / job mgmt • See apachelogs example in the fusion-spark- bootcamp sessionize using window function and then compute aggregations for each session
  • 14. 14 01 Data science in a box • REPL with hooks into Solr for quickly exploring unstructured data sets • Jake's RecSys recipe for building recommender systems • Full access to Lucene text analyzers when building ML pipelines • See mlsvm & ml20news labs in the fusion-spark- bootcamp • searchhub.lucidworks.com see slides from Grant’s talk about SearchHub
  • 15. 15 01 Machine Learning in Index & Query Pipelines • Query intent • Document classification • Recommendations • Design / evaluate / refine models in Spark ML pipelines or MLlib and then publish to Fusion to generate predictions from query / index pipelines
  • 17. 17 03 You could do this yourself … • It’s too easy to fallback into the trap of thinking that hard work getting cool technologies working together equates to business value. • Get back to focusing on solving business problems ~ increased ROI, faster • Fusion gives you a clear buy vs. build choice
  • 18. Billions of Docs Optional REST Security woven throughout Proxy Recs Worker Pipes Metrics NLP Sched. Blobs Admin Connectors Worker Cluster Mgr. Spark Shards Shards Solr HDFS Shared Config Mgmt Leader Election Load Balancing ZK 1 Zookeeper ZK N Signals Fusion Architecture Millions of Users
  • 19. 19 01 Thanks! Q & A • Try Fusion: https://lucidworks.com/products/fusion/download/ • spark-solr: http://bit.ly/1Ub12GU • fusion-spark-bootcamp: http://bit.ly/2dZfBhk • 40% off Manning books coupon code: ctwlucsoltw