Searching Billions of Product Logs in Real Time (Use Case)

SEARCHING BILLIONS
OF PRODUCT LOGS IN
REALTIME
RyanTabora -Think Big Analytics
NoSQL Search Roadshow - June 6, 2013

WHO AM I?
RyanTabora
Think Big Analytics - Senior Data Engineer
Lover of dachshunds, bass, and zombies

OVERVIEW
Primers
What are product logs?
How do they apply to big data?
Real use case
Real issues and designs
Conclusion

PRODUCT LOGS?
• Device data
• IT, Energy, Healthcare, Manufacturing,Telecom ...
• These devices are pushing data back home (pull works too!)
• As more devices are sold/installed, more and more data
comes back to ‘home base’

• RealtimeVisualization
• Realtime Response
• Ad Hoc Analysis
• Full Historical Capture
• Blended Data Sets
POWER OF DEVICE DATA
DEVICE
DATA

TRADITIONAL APPROACHES
• SQL: PostGres, MySQL, Oracle, Microsoft
• SQL provides many of the search features required for typical
search applications
• Joins, regex, group by, sorting, etc
• But the these technologies can only scale so far...

• Hadoop
• HBase/Cassandra/Accumulo
• Search features are very limited
• HBase row scans, primary key index
• Cassandra limited secondary indexing
NEWTECHNIQUES
STORING DATA

• What is an index?
• Lucene
• Paralleling Index Creation
• MapReduce/Flume/Storm
• RealTime Search
• Searching before it hits disk
NEWTECHNIQUES
INDEXING DATA

• Solr/ElasticSearch
• Both build on top of Lucene
• Search servers
• RESTful HTTP APIs
• Easy to administer
• Add powerful text/numerical search capabilities
NEWTECHNIQUES
SEARCHING DATA

BASIC SEARCH FEATURES
• Boolean logic (AND, OR + -)
• Sorting and Group By
• Range queries
• Phrase/Preﬁx/Fuzzy queries

ADVANCED SEARCH
FEATURES
• Custom ranking/scoring
• More like this
• Auto suggest
• Faceting/Highlighting
• Geo-spacial search

SCALING SEARCH
• ElasticSearch and SolrCloud both have distributed features
built in
• Auto-sharding
• Replication
• Query routing
• Transaction log

USE CASE
Problem
Sample Solution
Core Design Issues
Other Solutions

THE PROBLEM
Home
Base
NetApp
Filer
NetApp
FilerDevice
Client A
NetApp
Filer
NetApp
FilerDevice
Client B
NetApp
Filer
NetApp
FilerDevice
Client C
Log
Log
Log
Logs
REST
API
Full SQL
Access
Flat File
Access
Latest
All
Applications
Engineers
& Analysts

SEARCH APPLICATION
FEATURES
• Find last three days of raw logs from an entire cluster
• Group capacity available grouped by machine serial number
and show the largest capacities ﬁrst
• Search all device header lines for “FAILURE”
• View all hard disk objects that have product number 2341AB
• Find all motherboards with an associated customer ticket

SAMPLE SOLUTION
Logs
Ingestion
Parsing/Loading
CustomRESTfulSearchAPI
QueriesIndexing
HDFS
MapReduce

PARSING, LOADING,AND
INDEXING
Load HBase with parsed objects
Store HBase ROW_ID
Store pointer to raw file in HDFS
Index a number of desired fields
/ingestion/sequencefile1
1534 4562 5323 7232
4601 5105
0
0
0 1492 2987 4767 5987

INSIDE OF HBASE
...... ......
...... ...
...... .........
object5
…...
object4
...
object3object2
...
object1rowkey

THE SOLR DOCUMENT
2343sfOffset
/ingest/file2sequenceFile
1333-2241-3411cluster_id
42ADFF-BZMM
...
configs.log
...
2013/05/12
WARNING: DISK DEAD
...
header
contents
file_name
date_sent
system_id
rowkey
Solr Document

SEARCH APPLICATION
Search Application
Query
Data locations
Stored Object
User
Query Results
2
1
3
4
5
8
6
7
Raw Data
Location
Raw Data
Row_id

CORE DESIGN ISSUES
• Changing the Solr schema (manual reindex)
• Elastic shard scaling (manual reindex)
• No distributed joining (denormalizing the data)
• Replication*
• Manually managing Solr partitioning/sharding*
• Write durability*

SOLRCLOUD
• Automatic shard creation, routing
• Replication
• Limited to a ﬁxed number of shards deﬁned on initial creation
• ZooKeeper for coordination
• Large community

ELASTICSEARCH
• Similar feature set to Solr
• Purpose built for easily managing a distributed index
• Rapidly growing community
• Custom built coordination mechanism
• JSON based API

• Integrates Cassandra and Solr
• Automatic indexing in Solr/storing in Cassandra
• Automatic partitioning
• Automatic reindexing
• Not limited to ﬁxed number of shards
• Proprietary and costs money
DATASTAX ENTERPRISE
+

• Collecting and analyzing device data/product logs can be a
very difﬁcult challenge
• You can use NoSQL and search technologies like Solr or
ElasticSearch in unison...
• ...but it is not always easy to integrate search with NoSQL
CONCLUSION

QUESTIONS?
• Feel free to reach out if you have any questions or need help
with big data/search!
• http://ryantabora.com
• http://thinkbiganalytics.com
• @ryantabora
• ryan.tabora@thinkbiganalytics.com

HBASE AND SOLR
• Automatic partitioning/reindexing
• Automatic index updates on HBase inserts/deletes
• Mapping HBase cells to a Solr schema
• No perfect commercial/open source solution yet
• Many many many more...

HBASE + SOLR
AUTOMATIC INDEXING
• HBase coprocessors are like storedprocs/triggers
• New, powerful, and dangerous
• Triggers on HBase puts/deletes
• Mapping data to a schema?

HBASE + SOLR
WRITE DURABILITY
Solr
Shard 1
Solr
Shard 3
Solr
Shard 2
HBase Table - SOLR_QUEUE
MapReduce Indexing Application
Solr Queue
Reader
Create SolrDocument from raw ASUP1
Get oldest SolrDocument from HBase
Queue Table
3
Use custom hash algorithm to determine
which shard to add SolrDocument to
4
Query Solr, if SolrDocument
was added, then remove it
from the SOLR_QUEUE
5
SolrDocument
SolrDocument
Add SolrDocument to HBase for durability2

HBASE + SOLR
ELASTIC SHARDING
• HBase’s distributing mechanism uses the concept of regions to
split data across many nodes
• Region splitting can be automatic or manual (performance
degradation as regions split)
• Piggybacking Solr sharding on HBase Region splitting

Searching Billions of Product Logs in Real Time (Use Case)

More Related Content

What's hot (20)

Similar to Searching Billions of Product Logs in Real Time (Use Case) (20)

Recently uploaded (20)

Searching Billions of Product Logs in Real Time (Use Case)