SlideShare a Scribd company logo
BUILDING REAL-TIME ANALYTICS
With DSE Enterprise.
jKoolCloud.com
1
Objectives
• Store everything, analyze everything…
• Combined real-time & historical analytics
• Fast response, flexible query capabilities
• Target - for business user
• Insulate us from underlying software
• Hide complexity
• Scale for ingesting data-in-motion
• Scale for storing data-at-rest
• Elasticity & Operational efficiency
• Ease of monitoring & management
2
Technologies we considered?
• SQL (Oracle, MySQL, etc.)
• No scale. We have had a lot of experience our customer’s issues
with this at our parent company Nastel…
• RAM was “the” bottleneck. Commits take too long and while that is
happening everything else stops
• NoSQL
• Cassandra/Solr (DSE)
• Hadoop/MapReduce
• MongoDB
• Clustered Computing Platforms
• STORM
• MapReduce
• Spark (we learned about this while building jKool)
3
Why we chose Cassandra/Solr?
• Pros:
• Simple to setup & scale for clustered deployments
• Scalable, resilient, fault-tolerant (easy replication)
• Ability to have data automatically expire (TTL – necessary for our pricing model)
• Configurable replication strategy
• Great for heavy write workloads
• Write performance was better than Hadoop.
• Insert rate was of paramount importance for us – get data in as fast as possible was
our goal
• Java driver balances the load amongst the nodes in a cluster for us (master-slave
would never have worked for us)
• Solr provides a way to index all incoming data - essential
• DSE provides a nice integration between Cassandra and Solr
• Cons:
• Susceptible to GC pauses (memory management)
• The more memory the more GC pauses
• Less memory and more nodes seems a better approach than one big “honking” server
(we see 6-8GB optimal, so far)
• Data compaction tasks may hang
4
Why not Hadoop MapReduce?
• MapReduce too slow for real-time workloads
• Ok for batch, not so great for real-time
• Need to be paired with other technologies for query (Hive/Pig)
• Complex to setup, run and operate
• Our goals were simplicity first…
• Opted for STORM/SPARK wrapped with our own micro
services platform FatPipes instead of the Map Reduce
functionality
5
Why we chose Cassandra/Solr vs. Mongo?
• Why not Mongo?
• Global write-lock performance concerns…
• Cassandra/Solr
• Java based (our project was in Java)
• Easy to scale, replicate data,
• Flexible write & write consistency levels (ALL, QUORUM, ANY,
etc.)
• Did we say Java? Yes.(we like Java…)
• Flexible choice of platform coverage
• Great for time-series data streams (market focus for jKool)
• Inherent query limitations in Cassandra solved via Solr
integration (provided with DSE – as mentioned earlier)
6
How we achieved near real-time analytics?
• Created our own micro-services architecture (FatPipes)
which runs on top of:
• STORM/JMS/Kafka
• FatPipes can be embedded or distributed
• Real-time Grid
• Feeds tracking data and real-time queries to CEP and back
• User interacts with Real-time via JKQL (jKool Query Language)
• English like query language for analyzing data in motion and at rest.
• “Subscribe” verb for real-time updates
Real-time (Real-time.png)
7
Why clustered computing platforms?
• STORM paired with Kafka/JMS and CEP
• Clustered way to process incoming real-time streams
• STORM handles clustering/distribution
• Kafka/JMS for a messaging between grids
• Split streaming workload across the cluster
• Achieve linear scalability for incoming real-time streams
• Apache Spark (alternative to MapReduce)
• For distributing queries and trend analysis
• Micro batching for historical analytics
• Loading large dataset into memory (across different nodes)
• Running queries against large data-sets
8
Key to Real-time Analytics
• Process streams as they come while at the same time
avoiding IO
• Streams are split into real-time queue and persistence queue with
eventual consistency (eventually… both real-time and historical
must reconcile)
• Both have to be processed in parallel
• Writing to persistence layer and then analyzing will not achieve
near-real time processing
9
High Level Architecture
10
Deeper View
Web Application Server Web Application Server Web Application Server
jKool Web Grid
Cassandra
Cassandra
Cassandra
Cassandra
Storage Grid
Solr
Solr
Solr
Solr
Search Grid
Digest, Index
Real-time Grid
JKQL
FatPipes Micro Services (INGEST)
Compute Grid
FatPipes Micro Services (REAL-TIME)
(STORM/CEP)
Distributed Messaging (JMS or Kafka)
11
Challenges we ran into?
• So many technology options (…so little time…)
• Deciding on the right combination is key early on
• Cassandra/Solr deployment – (it was a learning experience for us)
• Lots of configuration, memory management, replication options
• Monitoring, managing clusters
• Cassandra/Solr, STORM, Zookeeper, Messaging
• +Leverage parent company’s AutoPilot Technology
• Achieving near real-time analytics proved extremely
challenging – but we did it!
• Keeping track of latencies across cluster
• Estimating computational capacity required to crunch incoming
streams
12
Business Analyst User Interface
It's easy to “visualize your data”
13
jKOOL IN REAL-TIME
Real-time Demonstration of jKool’s usage of DSE

More Related Content

What's hot (20)

PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
PDF
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
DataStax
 
PDF
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
DataStax
 
PPTX
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
PDF
Analytics with Cassandra & Spark
Matthias Niehoff
 
PDF
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
DataStax
 
PDF
Time series with apache cassandra strata
Patrick McFadin
 
PDF
Cassandra Basics, Counters and Time Series Modeling
Vassilis Bekiaris
 
PPTX
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
DataStax
 
PDF
Storing time series data with Apache Cassandra
Patrick McFadin
 
PDF
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
DataStax
 
PPTX
Cassandra Summit 2015: Intro to DSE Search
Caleb Rackliffe
 
PPTX
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
DataStax
 
PDF
Spark with Cassandra by Christopher Batey
Spark Summit
 
PDF
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
DataStax
 
PPTX
DataStax: An Introduction to DataStax Enterprise Search
DataStax Academy
 
PDF
Cassandra + Spark + Elk
Vasil Remeniuk
 
PDF
Spark and cassandra (Hulu Talk)
Jon Haddad
 
PDF
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
PDF
Cassandra and Spark: Optimizing for Data Locality
Russell Spitzer
 
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
DataStax
 
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
DataStax
 
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
Analytics with Cassandra & Spark
Matthias Niehoff
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
DataStax
 
Time series with apache cassandra strata
Patrick McFadin
 
Cassandra Basics, Counters and Time Series Modeling
Vassilis Bekiaris
 
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
DataStax
 
Storing time series data with Apache Cassandra
Patrick McFadin
 
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
DataStax
 
Cassandra Summit 2015: Intro to DSE Search
Caleb Rackliffe
 
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
DataStax
 
Spark with Cassandra by Christopher Batey
Spark Summit
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
DataStax
 
DataStax: An Introduction to DataStax Enterprise Search
DataStax Academy
 
Cassandra + Spark + Elk
Vasil Remeniuk
 
Spark and cassandra (Hulu Talk)
Jon Haddad
 
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
Cassandra and Spark: Optimizing for Data Locality
Russell Spitzer
 

Viewers also liked (20)

PDF
An Introduction to Distributed Search with Cassandra and Solr
DataStax Academy
 
PPTX
Using Event-Driven Architectures with Cassandra
DataStax Academy
 
PDF
Solr & Cassandra: Searching Cassandra with DataStax Enterprise
DataStax Academy
 
PDF
SMARTSTUDY Django 오픈 세션 2012-08
Hyun-woo Park
 
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
PPTX
Webinar Google Analytics Real Time MA 22-11-11
Watt
 
PPTX
Real-Time Big Data with Storm, Kafka and GigaSpaces
Oleksii Diagiliev
 
PDF
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Lucidworks
 
PDF
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Lucidworks
 
PDF
Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...
DataStax Academy
 
PDF
DataStax: How to Roll Cassandra into Production Without Losing your Health, M...
DataStax Academy
 
PDF
DataStax: Old Dogs, New Tricks. Teaching your Relational DBA to fetch
DataStax Academy
 
PDF
Battery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
DataStax Academy
 
PDF
Cassandra 3.0 Awesomeness
Jon Haddad
 
PDF
DataStax: 7 Deadly Sins for Cassandra Ops
DataStax Academy
 
PDF
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax Academy
 
PDF
Crash course intro to cassandra
Jon Haddad
 
PDF
Cassandra Core Concepts
Jon Haddad
 
PDF
Diagnosing Problems in Production - Cassandra
Jon Haddad
 
PDF
Enter the Snake Pit for Fast and Easy Spark
Jon Haddad
 
An Introduction to Distributed Search with Cassandra and Solr
DataStax Academy
 
Using Event-Driven Architectures with Cassandra
DataStax Academy
 
Solr & Cassandra: Searching Cassandra with DataStax Enterprise
DataStax Academy
 
SMARTSTUDY Django 오픈 세션 2012-08
Hyun-woo Park
 
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Webinar Google Analytics Real Time MA 22-11-11
Watt
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Oleksii Diagiliev
 
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Lucidworks
 
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Lucidworks
 
Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...
DataStax Academy
 
DataStax: How to Roll Cassandra into Production Without Losing your Health, M...
DataStax Academy
 
DataStax: Old Dogs, New Tricks. Teaching your Relational DBA to fetch
DataStax Academy
 
Battery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
DataStax Academy
 
Cassandra 3.0 Awesomeness
Jon Haddad
 
DataStax: 7 Deadly Sins for Cassandra Ops
DataStax Academy
 
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax Academy
 
Crash course intro to cassandra
Jon Haddad
 
Cassandra Core Concepts
Jon Haddad
 
Diagnosing Problems in Production - Cassandra
Jon Haddad
 
Enter the Snake Pit for Fast and Easy Spark
Jon Haddad
 
Ad

Similar to How We Used Cassandra/Solr to Build Real-Time Analytics Platform (20)

PPTX
How jKool Analyzes Streaming Data in Real Time with DataStax
DataStax
 
PPTX
How jKool Analyzes Streaming Data in Real Time with DataStax
jKool
 
PPTX
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
Impetus Technologies
 
PPTX
Big Data Warehousing Meetup with Riak
Caserta
 
PDF
How can Hadoop & SAP be integrated
Douglas Bernardini
 
PPTX
An Enterprise Architect's View of MongoDB
MongoDB
 
PPTX
Real-time searching of big data with Solr and Hadoop
Rogue Wave Software
 
PPT
SQL, NoSQL, BigData in Data Architecture
Venu Anuganti
 
PDF
Architecting Data in the AWS Ecosystem
SingleStore
 
PDF
BIG DATA: From mammoth to elephant
Roman Nikitchenko
 
PPTX
NoSQL A brief look at Apache Cassandra Distributed Database
Joe Alex
 
PDF
Webinar: The Future of SQL
Crate.io
 
PDF
Tweaking performance on high-load projects
Dmitriy Dumanskiy
 
PPT
Big Data
NGDATA
 
PPTX
Top 10 lessons learned from deploying hadoop in a private cloud
Rogue Wave Software
 
PPTX
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku
 
PDF
Fb talk arch_summit
drewz lin
 
PPTX
MyHeritage backend group - build to scale
Ran Levy
 
PPT
Big Data Real Time Analytics - A Facebook Case Study
Nati Shalom
 
PPTX
Comparing sql and nosql dbs
Vasilios Kuznos
 
How jKool Analyzes Streaming Data in Real Time with DataStax
DataStax
 
How jKool Analyzes Streaming Data in Real Time with DataStax
jKool
 
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
Impetus Technologies
 
Big Data Warehousing Meetup with Riak
Caserta
 
How can Hadoop & SAP be integrated
Douglas Bernardini
 
An Enterprise Architect's View of MongoDB
MongoDB
 
Real-time searching of big data with Solr and Hadoop
Rogue Wave Software
 
SQL, NoSQL, BigData in Data Architecture
Venu Anuganti
 
Architecting Data in the AWS Ecosystem
SingleStore
 
BIG DATA: From mammoth to elephant
Roman Nikitchenko
 
NoSQL A brief look at Apache Cassandra Distributed Database
Joe Alex
 
Webinar: The Future of SQL
Crate.io
 
Tweaking performance on high-load projects
Dmitriy Dumanskiy
 
Big Data
NGDATA
 
Top 10 lessons learned from deploying hadoop in a private cloud
Rogue Wave Software
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku
 
Fb talk arch_summit
drewz lin
 
MyHeritage backend group - build to scale
Ran Levy
 
Big Data Real Time Analytics - A Facebook Case Study
Nati Shalom
 
Comparing sql and nosql dbs
Vasilios Kuznos
 
Ad

More from DataStax Academy (20)

PDF
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
PPTX
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
PPTX
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
PPTX
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
PDF
Cassandra 3.0 Data Modeling
DataStax Academy
 
PPTX
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
PDF
Data Modeling for Apache Cassandra
DataStax Academy
 
PDF
Coursera Cassandra Driver
DataStax Academy
 
PDF
Production Ready Cassandra
DataStax Academy
 
PDF
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
PDF
Standing Up Your First Cluster
DataStax Academy
 
PDF
Real Time Analytics with Dse
DataStax Academy
 
PDF
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
Cassandra Core Concepts
DataStax Academy
 
PPTX
Bad Habits Die Hard
DataStax Academy
 
PDF
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
Advanced Cassandra
DataStax Academy
 
PDF
Apache Cassandra and Drivers
DataStax Academy
 
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra 3.0 Data Modeling
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Data Modeling for Apache Cassandra
DataStax Academy
 
Coursera Cassandra Driver
DataStax Academy
 
Production Ready Cassandra
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
Standing Up Your First Cluster
DataStax Academy
 
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Cassandra Core Concepts
DataStax Academy
 
Bad Habits Die Hard
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
Advanced Cassandra
DataStax Academy
 
Apache Cassandra and Drivers
DataStax Academy
 

Recently uploaded (20)

PPTX
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PPTX
UNIT 1 - INTRODUCTION TO AI and AI tools and basic concept
gokuld13012005
 
PPTX
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
 
PPTX
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
PPTX
Water Resources Engineering (CVE 728)--Slide 4.pptx
mohammedado3
 
PDF
mbse_An_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
PPTX
Knowledge Representation : Semantic Networks
Amity University, Patna
 
PPTX
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
PPTX
Biosensors, BioDevices, Biomediccal.pptx
AsimovRiyaz
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PDF
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PDF
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
PPTX
darshai cross section and river section analysis
muk7971
 
PPTX
MODULE 03 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PPTX
仿制LethbridgeOffer加拿大莱斯桥大学毕业证范本,Lethbridge成绩单
Taqyea
 
PPTX
原版一样(EC Lille毕业证书)法国里尔中央理工学院毕业证补办
Taqyea
 
PDF
20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdf
Ashutosh Satapathy
 
PPTX
MODULE 04 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
UNIT 1 - INTRODUCTION TO AI and AI tools and basic concept
gokuld13012005
 
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
 
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
Water Resources Engineering (CVE 728)--Slide 4.pptx
mohammedado3
 
mbse_An_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
Knowledge Representation : Semantic Networks
Amity University, Patna
 
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
Biosensors, BioDevices, Biomediccal.pptx
AsimovRiyaz
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
darshai cross section and river section analysis
muk7971
 
MODULE 03 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
仿制LethbridgeOffer加拿大莱斯桥大学毕业证范本,Lethbridge成绩单
Taqyea
 
原版一样(EC Lille毕业证书)法国里尔中央理工学院毕业证补办
Taqyea
 
20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdf
Ashutosh Satapathy
 
MODULE 04 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 

How We Used Cassandra/Solr to Build Real-Time Analytics Platform

  • 1. BUILDING REAL-TIME ANALYTICS With DSE Enterprise. jKoolCloud.com 1
  • 2. Objectives • Store everything, analyze everything… • Combined real-time & historical analytics • Fast response, flexible query capabilities • Target - for business user • Insulate us from underlying software • Hide complexity • Scale for ingesting data-in-motion • Scale for storing data-at-rest • Elasticity & Operational efficiency • Ease of monitoring & management 2
  • 3. Technologies we considered? • SQL (Oracle, MySQL, etc.) • No scale. We have had a lot of experience our customer’s issues with this at our parent company Nastel… • RAM was “the” bottleneck. Commits take too long and while that is happening everything else stops • NoSQL • Cassandra/Solr (DSE) • Hadoop/MapReduce • MongoDB • Clustered Computing Platforms • STORM • MapReduce • Spark (we learned about this while building jKool) 3
  • 4. Why we chose Cassandra/Solr? • Pros: • Simple to setup & scale for clustered deployments • Scalable, resilient, fault-tolerant (easy replication) • Ability to have data automatically expire (TTL – necessary for our pricing model) • Configurable replication strategy • Great for heavy write workloads • Write performance was better than Hadoop. • Insert rate was of paramount importance for us – get data in as fast as possible was our goal • Java driver balances the load amongst the nodes in a cluster for us (master-slave would never have worked for us) • Solr provides a way to index all incoming data - essential • DSE provides a nice integration between Cassandra and Solr • Cons: • Susceptible to GC pauses (memory management) • The more memory the more GC pauses • Less memory and more nodes seems a better approach than one big “honking” server (we see 6-8GB optimal, so far) • Data compaction tasks may hang 4
  • 5. Why not Hadoop MapReduce? • MapReduce too slow for real-time workloads • Ok for batch, not so great for real-time • Need to be paired with other technologies for query (Hive/Pig) • Complex to setup, run and operate • Our goals were simplicity first… • Opted for STORM/SPARK wrapped with our own micro services platform FatPipes instead of the Map Reduce functionality 5
  • 6. Why we chose Cassandra/Solr vs. Mongo? • Why not Mongo? • Global write-lock performance concerns… • Cassandra/Solr • Java based (our project was in Java) • Easy to scale, replicate data, • Flexible write & write consistency levels (ALL, QUORUM, ANY, etc.) • Did we say Java? Yes.(we like Java…) • Flexible choice of platform coverage • Great for time-series data streams (market focus for jKool) • Inherent query limitations in Cassandra solved via Solr integration (provided with DSE – as mentioned earlier) 6
  • 7. How we achieved near real-time analytics? • Created our own micro-services architecture (FatPipes) which runs on top of: • STORM/JMS/Kafka • FatPipes can be embedded or distributed • Real-time Grid • Feeds tracking data and real-time queries to CEP and back • User interacts with Real-time via JKQL (jKool Query Language) • English like query language for analyzing data in motion and at rest. • “Subscribe” verb for real-time updates Real-time (Real-time.png) 7
  • 8. Why clustered computing platforms? • STORM paired with Kafka/JMS and CEP • Clustered way to process incoming real-time streams • STORM handles clustering/distribution • Kafka/JMS for a messaging between grids • Split streaming workload across the cluster • Achieve linear scalability for incoming real-time streams • Apache Spark (alternative to MapReduce) • For distributing queries and trend analysis • Micro batching for historical analytics • Loading large dataset into memory (across different nodes) • Running queries against large data-sets 8
  • 9. Key to Real-time Analytics • Process streams as they come while at the same time avoiding IO • Streams are split into real-time queue and persistence queue with eventual consistency (eventually… both real-time and historical must reconcile) • Both have to be processed in parallel • Writing to persistence layer and then analyzing will not achieve near-real time processing 9
  • 11. Deeper View Web Application Server Web Application Server Web Application Server jKool Web Grid Cassandra Cassandra Cassandra Cassandra Storage Grid Solr Solr Solr Solr Search Grid Digest, Index Real-time Grid JKQL FatPipes Micro Services (INGEST) Compute Grid FatPipes Micro Services (REAL-TIME) (STORM/CEP) Distributed Messaging (JMS or Kafka) 11
  • 12. Challenges we ran into? • So many technology options (…so little time…) • Deciding on the right combination is key early on • Cassandra/Solr deployment – (it was a learning experience for us) • Lots of configuration, memory management, replication options • Monitoring, managing clusters • Cassandra/Solr, STORM, Zookeeper, Messaging • +Leverage parent company’s AutoPilot Technology • Achieving near real-time analytics proved extremely challenging – but we did it! • Keeping track of latencies across cluster • Estimating computational capacity required to crunch incoming streams 12
  • 13. Business Analyst User Interface It's easy to “visualize your data” 13
  • 14. jKOOL IN REAL-TIME Real-time Demonstration of jKool’s usage of DSE