SlideShare a Scribd company logo
SEARCHING BILLIONS
OF PRODUCT LOGS IN
REALTIME
RyanTabora -Think Big Analytics
NoSQL Search Roadshow - June 6, 2013
WHO AM I?
RyanTabora
Think Big Analytics - Senior Data Engineer
Lover of dachshunds, bass, and zombies
OVERVIEW
Primers
What are product logs?
How do they apply to big data?
Real use case
Real issues and designs
Conclusion
PRODUCT LOGS?
• Device data
• IT, Energy, Healthcare, Manufacturing,Telecom ...
• These devices are pushing data back home (pull works too!)
• As more devices are sold/installed, more and more data
comes back to ‘home base’
• RealtimeVisualization
• Realtime Response
• Ad Hoc Analysis
• Full Historical Capture
• Blended Data Sets
POWER OF DEVICE DATA
DEVICE
DATA
TRADITIONAL APPROACHES
• SQL: PostGres, MySQL, Oracle, Microsoft
• SQL provides many of the search features required for typical
search applications
• Joins, regex, group by, sorting, etc
• But the these technologies can only scale so far...
• Hadoop
• HBase/Cassandra/Accumulo
• Search features are very limited
• HBase row scans, primary key index
• Cassandra limited secondary indexing
NEWTECHNIQUES
STORING DATA
• What is an index?
• Lucene
• Paralleling Index Creation
• MapReduce/Flume/Storm
• RealTime Search
• Searching before it hits disk
NEWTECHNIQUES
INDEXING DATA
• Solr/ElasticSearch
• Both build on top of Lucene
• Search servers
• RESTful HTTP APIs
• Easy to administer
• Add powerful text/numerical search capabilities
NEWTECHNIQUES
SEARCHING DATA
BASIC SEARCH FEATURES
• Boolean logic (AND, OR + -)
• Sorting and Group By
• Range queries
• Phrase/Prefix/Fuzzy queries
ADVANCED SEARCH
FEATURES
• Custom ranking/scoring
• More like this
• Auto suggest
• Faceting/Highlighting
• Geo-spacial search
SCALING SEARCH
• ElasticSearch and SolrCloud both have distributed features
built in
• Auto-sharding
• Replication
• Query routing
• Transaction log
USE CASE
Problem
Sample Solution
Core Design Issues
Other Solutions
THE PROBLEM
Home
Base
NetApp
Filer
NetApp
FilerDevice
Client A
NetApp
Filer
NetApp
FilerDevice
Client B
NetApp
Filer
NetApp
FilerDevice
Client C
Log
Log
Log
Logs
REST
API
Full SQL
Access
Flat File
Access
Latest
All
Applications
Engineers
& Analysts
SEARCH APPLICATION
FEATURES
• Find last three days of raw logs from an entire cluster
• Group capacity available grouped by machine serial number
and show the largest capacities first
• Search all device header lines for “FAILURE”
• View all hard disk objects that have product number 2341AB
• Find all motherboards with an associated customer ticket
SAMPLE SOLUTION
Logs
Ingestion
Parsing/Loading
CustomRESTfulSearchAPI
QueriesIndexing
HDFS
MapReduce
INGESTION
PARSING, LOADING,AND
INDEXING
Load HBase with parsed objects
Store HBase ROW_ID
Store pointer to raw file in HDFS
Index a number of desired fields
/ingestion/sequencefile1
/ingestion/sequencefile2
/ingestion/sequencefile3
1534 4562 5323 7232
4601 5105
0
0
0 1492 2987 4767 5987
INSIDE OF HBASE
...... ......
...... ...
...... .........
object5
…...
object4
...
object3object2
...
object1rowkey
THE SOLR DOCUMENT
2343sfOffset
/ingest/file2sequenceFile
1333-2241-3411cluster_id
42ADFF-BZMM
...
configs.log
...
2013/05/12
WARNING: DISK DEAD
...
header
contents
file_name
date_sent
system_id
rowkey
Solr Document
SEARCH APPLICATION
Search Application
Query
Data locations
Stored Object
User
Query Results
2
1
3
4
5
8
6
7
Raw Data
Location
Raw Data
Row_id
CORE DESIGN ISSUES
• Changing the Solr schema (manual reindex)
• Elastic shard scaling (manual reindex)
• No distributed joining (denormalizing the data)
• Replication*
• Manually managing Solr partitioning/sharding*
• Write durability*
SOLRCLOUD
• Automatic shard creation, routing
• Replication
• Limited to a fixed number of shards defined on initial creation
• ZooKeeper for coordination
• Large community
ELASTICSEARCH
• Similar feature set to Solr
• Purpose built for easily managing a distributed index
• Rapidly growing community
• Custom built coordination mechanism
• JSON based API
• Integrates Cassandra and Solr
• Automatic indexing in Solr/storing in Cassandra
• Automatic partitioning
• Automatic reindexing
• Not limited to fixed number of shards
• Proprietary and costs money
DATASTAX ENTERPRISE
+
• Collecting and analyzing device data/product logs can be a
very difficult challenge
• You can use NoSQL and search technologies like Solr or
ElasticSearch in unison...
• ...but it is not always easy to integrate search with NoSQL
CONCLUSION
QUESTIONS?
• Feel free to reach out if you have any questions or need help
with big data/search!
• http://ryantabora.com
• http://thinkbiganalytics.com
• @ryantabora
• ryan.tabora@thinkbiganalytics.com
BONUS SLIDES
HBASE AND SOLR
• Automatic partitioning/reindexing
• Automatic index updates on HBase inserts/deletes
• Mapping HBase cells to a Solr schema
• No perfect commercial/open source solution yet
• Many many many more...
HBASE + SOLR
AUTOMATIC INDEXING
• HBase coprocessors are like storedprocs/triggers
• New, powerful, and dangerous
• Triggers on HBase puts/deletes
• Mapping data to a schema?
HBASE + SOLR
WRITE DURABILITY
Solr
Shard 1
Solr
Shard 3
Solr
Shard 2
HBase Table - SOLR_QUEUE
MapReduce Indexing Application
Solr Queue
Reader
Create SolrDocument from raw ASUP1
Get oldest SolrDocument from HBase
Queue Table
3
Use custom hash algorithm to determine
which shard to add SolrDocument to
4
Query Solr, if SolrDocument
was added, then remove it
from the SOLR_QUEUE
5
SolrDocument
SolrDocument
Add SolrDocument to HBase for durability2
HBASE + SOLR
ELASTIC SHARDING
• HBase’s distributing mechanism uses the concept of regions to
split data across many nodes
• Region splitting can be automatic or manual (performance
degradation as regions split)
• Piggybacking Solr sharding on HBase Region splitting

More Related Content

PPTX
Options for Data Prep - A Survey of the Current Market
Dremio Corporation
 
PDF
Optiq: A dynamic data management framework
Julian Hyde
 
PDF
Automate your data flows with Apache NIFI
Adam Doyle
 
PPTX
Azure data lake sql konf 2016
Kenneth Michael Nielsen
 
PPTX
Dive Into Azure Data Lake - PASS 2017
Ike Ellis
 
PPTX
ETL 2.0 Data Engineering for developers
Microsoft Tech Community
 
PDF
Marketing vs Technology
Nguyen Ngoc Hoai Aan
 
KEY
Cascalog at May Bay Area Hadoop User Group
nathanmarz
 
Options for Data Prep - A Survey of the Current Market
Dremio Corporation
 
Optiq: A dynamic data management framework
Julian Hyde
 
Automate your data flows with Apache NIFI
Adam Doyle
 
Azure data lake sql konf 2016
Kenneth Michael Nielsen
 
Dive Into Azure Data Lake - PASS 2017
Ike Ellis
 
ETL 2.0 Data Engineering for developers
Microsoft Tech Community
 
Marketing vs Technology
Nguyen Ngoc Hoai Aan
 
Cascalog at May Bay Area Hadoop User Group
nathanmarz
 

What's hot (20)

PPTX
Apache Spark in Industry
Dorian Beganovic
 
PDF
Machine Learning Data Lineage with MLflow and Delta Lake
Databricks
 
PPTX
A lap around microsofts business intelligence platform
Ike Ellis
 
PDF
Uber's data science workbench
Ran Wei
 
PPTX
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 
PPTX
Building a Big Data Pipeline
Jesus Rodriguez
 
PPTX
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Mats Uddenfeldt
 
PDF
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Dataconomy Media
 
PPTX
Bigdata antipatterns
Anurag S
 
PPTX
Using Hadoop to build a Data Quality Service for both real-time and batch data
DataWorks Summit/Hadoop Summit
 
PPTX
Spark, Tachyon and Mesos internals
Claudiu Barbura
 
PDF
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Domino Data Lab
 
PDF
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
Alluxio, Inc.
 
PDF
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
DataPad Inc.
 
PDF
Open Source DataViz with Apache Superset
Carl W. Handlin
 
PDF
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
PPTX
Data Science at Scale by Sarah Guido
Spark Summit
 
PPTX
Future of pandas
Jeff Reback
 
PDF
Advanced Analytics and Big Data (August 2014)
Thomas W. Dinsmore
 
PDF
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Nilesh Shah
 
Apache Spark in Industry
Dorian Beganovic
 
Machine Learning Data Lineage with MLflow and Delta Lake
Databricks
 
A lap around microsofts business intelligence platform
Ike Ellis
 
Uber's data science workbench
Ran Wei
 
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 
Building a Big Data Pipeline
Jesus Rodriguez
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Mats Uddenfeldt
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Dataconomy Media
 
Bigdata antipatterns
Anurag S
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
DataWorks Summit/Hadoop Summit
 
Spark, Tachyon and Mesos internals
Claudiu Barbura
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Domino Data Lab
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
Alluxio, Inc.
 
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
DataPad Inc.
 
Open Source DataViz with Apache Superset
Carl W. Handlin
 
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
Data Science at Scale by Sarah Guido
Spark Summit
 
Future of pandas
Jeff Reback
 
Advanced Analytics and Big Data (August 2014)
Thomas W. Dinsmore
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Nilesh Shah
 
Ad

Similar to Searching Billions of Product Logs in Real Time (Use Case) (20)

PPTX
Real-time searching of big data with Solr and Hadoop
Rogue Wave Software
 
PPTX
Unifying your data management with Hadoop
Jayant Shekhar
 
PPTX
MyHeritage backend group - build to scale
Ran Levy
 
PPTX
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
Cloudera, Inc.
 
PDF
Data Engineering with Solr and Spark
Lucidworks
 
PDF
Apache Solr as a compressed, scalable, and high performance time series database
Florian Lautenschlager
 
KEY
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Bradford Stephens
 
PDF
Cloudera search
Mark Kerzner
 
PPTX
Apache Solr - search for everyone!
Jaran Flaath
 
PDF
Webinar: Solr 6 Deep Dive - SQL and Graph
Lucidworks
 
PDF
Hadoop-scale Search with Solr
DataWorks Summit
 
PDF
KEYNOTE: Lucene / Solr road map
lucenerevolution
 
PDF
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
lucenerevolution
 
PDF
HBase ArcheTypes
Matteo Bertozzi
 
PPTX
Essential Data Engineering for Data Scientist
SoftServe
 
PDF
NoSQL, Apache SOLR and Apache Hadoop
Dmitry Kan
 
PDF
Solr Power FTW: Powering NoSQL the World Over
Alex Pinkin
 
PPTX
Webinar: Solr & Fusion for Big Data
Lucidworks
 
PDF
2013 11-07 lsr-dublin_m_hausenblas_when solr is best
lucenerevolution
 
KEY
Solr 101
Findwise
 
Real-time searching of big data with Solr and Hadoop
Rogue Wave Software
 
Unifying your data management with Hadoop
Jayant Shekhar
 
MyHeritage backend group - build to scale
Ran Levy
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
Cloudera, Inc.
 
Data Engineering with Solr and Spark
Lucidworks
 
Apache Solr as a compressed, scalable, and high performance time series database
Florian Lautenschlager
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Bradford Stephens
 
Cloudera search
Mark Kerzner
 
Apache Solr - search for everyone!
Jaran Flaath
 
Webinar: Solr 6 Deep Dive - SQL and Graph
Lucidworks
 
Hadoop-scale Search with Solr
DataWorks Summit
 
KEYNOTE: Lucene / Solr road map
lucenerevolution
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
lucenerevolution
 
HBase ArcheTypes
Matteo Bertozzi
 
Essential Data Engineering for Data Scientist
SoftServe
 
NoSQL, Apache SOLR and Apache Hadoop
Dmitry Kan
 
Solr Power FTW: Powering NoSQL the World Over
Alex Pinkin
 
Webinar: Solr & Fusion for Big Data
Lucidworks
 
2013 11-07 lsr-dublin_m_hausenblas_when solr is best
lucenerevolution
 
Solr 101
Findwise
 
Ad

Recently uploaded (20)

PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Software Development Methodologies in 2025
KodekX
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 

Searching Billions of Product Logs in Real Time (Use Case)

  • 1. SEARCHING BILLIONS OF PRODUCT LOGS IN REALTIME RyanTabora -Think Big Analytics NoSQL Search Roadshow - June 6, 2013
  • 2. WHO AM I? RyanTabora Think Big Analytics - Senior Data Engineer Lover of dachshunds, bass, and zombies
  • 3. OVERVIEW Primers What are product logs? How do they apply to big data? Real use case Real issues and designs Conclusion
  • 4. PRODUCT LOGS? • Device data • IT, Energy, Healthcare, Manufacturing,Telecom ... • These devices are pushing data back home (pull works too!) • As more devices are sold/installed, more and more data comes back to ‘home base’
  • 5. • RealtimeVisualization • Realtime Response • Ad Hoc Analysis • Full Historical Capture • Blended Data Sets POWER OF DEVICE DATA DEVICE DATA
  • 6. TRADITIONAL APPROACHES • SQL: PostGres, MySQL, Oracle, Microsoft • SQL provides many of the search features required for typical search applications • Joins, regex, group by, sorting, etc • But the these technologies can only scale so far...
  • 7. • Hadoop • HBase/Cassandra/Accumulo • Search features are very limited • HBase row scans, primary key index • Cassandra limited secondary indexing NEWTECHNIQUES STORING DATA
  • 8. • What is an index? • Lucene • Paralleling Index Creation • MapReduce/Flume/Storm • RealTime Search • Searching before it hits disk NEWTECHNIQUES INDEXING DATA
  • 9. • Solr/ElasticSearch • Both build on top of Lucene • Search servers • RESTful HTTP APIs • Easy to administer • Add powerful text/numerical search capabilities NEWTECHNIQUES SEARCHING DATA
  • 10. BASIC SEARCH FEATURES • Boolean logic (AND, OR + -) • Sorting and Group By • Range queries • Phrase/Prefix/Fuzzy queries
  • 11. ADVANCED SEARCH FEATURES • Custom ranking/scoring • More like this • Auto suggest • Faceting/Highlighting • Geo-spacial search
  • 12. SCALING SEARCH • ElasticSearch and SolrCloud both have distributed features built in • Auto-sharding • Replication • Query routing • Transaction log
  • 13. USE CASE Problem Sample Solution Core Design Issues Other Solutions
  • 14. THE PROBLEM Home Base NetApp Filer NetApp FilerDevice Client A NetApp Filer NetApp FilerDevice Client B NetApp Filer NetApp FilerDevice Client C Log Log Log Logs REST API Full SQL Access Flat File Access Latest All Applications Engineers & Analysts
  • 15. SEARCH APPLICATION FEATURES • Find last three days of raw logs from an entire cluster • Group capacity available grouped by machine serial number and show the largest capacities first • Search all device header lines for “FAILURE” • View all hard disk objects that have product number 2341AB • Find all motherboards with an associated customer ticket
  • 18. PARSING, LOADING,AND INDEXING Load HBase with parsed objects Store HBase ROW_ID Store pointer to raw file in HDFS Index a number of desired fields /ingestion/sequencefile1 /ingestion/sequencefile2 /ingestion/sequencefile3 1534 4562 5323 7232 4601 5105 0 0 0 1492 2987 4767 5987
  • 19. INSIDE OF HBASE ...... ...... ...... ... ...... ......... object5 …... object4 ... object3object2 ... object1rowkey
  • 21. SEARCH APPLICATION Search Application Query Data locations Stored Object User Query Results 2 1 3 4 5 8 6 7 Raw Data Location Raw Data Row_id
  • 22. CORE DESIGN ISSUES • Changing the Solr schema (manual reindex) • Elastic shard scaling (manual reindex) • No distributed joining (denormalizing the data) • Replication* • Manually managing Solr partitioning/sharding* • Write durability*
  • 23. SOLRCLOUD • Automatic shard creation, routing • Replication • Limited to a fixed number of shards defined on initial creation • ZooKeeper for coordination • Large community
  • 24. ELASTICSEARCH • Similar feature set to Solr • Purpose built for easily managing a distributed index • Rapidly growing community • Custom built coordination mechanism • JSON based API
  • 25. • Integrates Cassandra and Solr • Automatic indexing in Solr/storing in Cassandra • Automatic partitioning • Automatic reindexing • Not limited to fixed number of shards • Proprietary and costs money DATASTAX ENTERPRISE +
  • 26. • Collecting and analyzing device data/product logs can be a very difficult challenge • You can use NoSQL and search technologies like Solr or ElasticSearch in unison... • ...but it is not always easy to integrate search with NoSQL CONCLUSION
  • 27. QUESTIONS? • Feel free to reach out if you have any questions or need help with big data/search! • http://ryantabora.com • http://thinkbiganalytics.com • @ryantabora • [email protected]
  • 29. HBASE AND SOLR • Automatic partitioning/reindexing • Automatic index updates on HBase inserts/deletes • Mapping HBase cells to a Solr schema • No perfect commercial/open source solution yet • Many many many more...
  • 30. HBASE + SOLR AUTOMATIC INDEXING • HBase coprocessors are like storedprocs/triggers • New, powerful, and dangerous • Triggers on HBase puts/deletes • Mapping data to a schema?
  • 31. HBASE + SOLR WRITE DURABILITY Solr Shard 1 Solr Shard 3 Solr Shard 2 HBase Table - SOLR_QUEUE MapReduce Indexing Application Solr Queue Reader Create SolrDocument from raw ASUP1 Get oldest SolrDocument from HBase Queue Table 3 Use custom hash algorithm to determine which shard to add SolrDocument to 4 Query Solr, if SolrDocument was added, then remove it from the SOLR_QUEUE 5 SolrDocument SolrDocument Add SolrDocument to HBase for durability2
  • 32. HBASE + SOLR ELASTIC SHARDING • HBase’s distributing mechanism uses the concept of regions to split data across many nodes • Region splitting can be automatic or manual (performance degradation as regions split) • Piggybacking Solr sharding on HBase Region splitting