Real-Time Big Data at In-Memory Speed, Using Storm

Real Time Big Data With Storm,
Cassandra, and In-Memory Computing
Nati Shalom @natishalom
DeWayne Filppi @dfilppi

Introduction to Real Time Analytics
Homeland Security
Real Time Search
Social
eCommerce
User Tracking &
Engagement
Financial Services
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved2

The Two Vs of Big Data
Velocity Volume

The Flavors of Big Data Analytics
Counting Correlating Research

It’s All about Timing
• Event driven / stream processing
• High resolution – every tweet gets counted
• Ad-hoc querying
• Medium resolution (aggregations)
• Long running batch jobs (ETL, map/reduce)
• Low resolution (trends & patterns)
This is what
we’re here
to discuss 

Facebook & Twitter Real Time Analytics

FACEBOOK REAL-TIME
ANALYTICS SYSTEM
(LOGGING CENTRIC APPROACH)
7

The actual analytics..
 Like button analytics
 Comments box analytics
8
® Copyright 2011 Gigaspaces Ltd. All Rights

PTail
Scribe
Puma
Hbase
FACEBOOK
Log
FACEBOOK
Log
FACEBOOK
Log
HDFS
Real Time Long Term
Batch
1.5 Sec
Facebook architecture..
10,000
write/sec
per server

TWITTER REAL-TIME
ANALYTICS SYSTEM
(EVENT DRIVEN APPROACH)
10

URL Mentions – Here’s One Use Case

Twitter Real Time Analytics based on Storm

Comparing the two approaches..
Facebook
 Rely on Hadoop for Real
Time and Batch
 RT = 10’s Sec
 Suits for Simple processing
 Low parallelization
Twitter
 Use Hadoop for Batch and
Storm for real time
 RT = Msec, Sec
 Suits for Complex
processing
 Extremely parallel
This is what
we’re here
to discuss 

Introduction
to Storm

 Popular open source, real time, in-memory, streaming
computation platform.
 Includes distributed runtime and intuitive API for defining
distributed processing flows.
 Scalable and fault tolerant.
 Developed at BackType,
and open sourced by Twitter
Storm Background

Storm Cluster

 Streams
 Unbounded sequence of tuples
 Spouts
 Source of streams (Queues)
 Bolts
 Functions, Filters, Joins, Aggregations
 Topologies
Storm Concepts
Spouts
Bolt
Topologies

Challenge – Word Count
Word:Count
Tweets
Count
• Hottest topics
• URL mentions
• etc.

Streaming word count with Storm

Computing Reach with Event Streams

But where is my
Big
Data?

Bolt
Bolt
Spout
The Big Picture …
Twitter
feed
Twitter
Feed
Twiter
Feed
Web
Activity
Web
Activity
Web
Activity
Analytics Data
Research
Data
Counters
Reference
Data
StormData feeds (Kafka, Twitter,..) Cassandra, MongoDB, Hbase,..
End to End Latency

 Storm performance and reliability
 Assumes success is normal
 Uses batching and pipelining for performance
 Storm plug-ins has significant effect on performance and
reliability
 Spout must be able to replay tuples on demand in case of error.
 Storm uses topology semantics for ensuring consistency
through event ordering
 Can be tedious for handling counters
 Doesn’t ensure the state of the counters
Your as as strong as your weakest link

Typical user experience…
Now, Kafka is *fast*. When running the Kafka
Spout by itself, I easily reproduced Kafka's claim
that you can consume "hundreds of thousands
of messages per second".
When I first fired up the topology, things went
well for the first minute, but then quickly
crashedas the Kafka spout emitted too
fast for the Cassandra Bolt to keep up. Even
though Cassandra is fast as well, it is still
orders of magnitude slower
than Kafka
Source: A Big Data Trifecta: Storm, Kafka and Cassandra. Brian Oniells Blog

What if we could put
everything In Memory?
An Alternative Approach

Did you know?
Facebook keeps 80%
of its data in
Memory
(Stanford research)
RAM is 100-1000x
faster than Disk
(Random seek)
• Disk: 5 -10ms
• RAM: ~0.001msec

 RAM is the new disk
 Data partitioned across a cluster
 Large “virtual” memory space
 Transactional
 Highly available
 Code with Data
In Memory Data Grid Review

Integrating with Storm
Bolt
Bolt
Spout
Web
Activity
Web
Activity
Web
Activity
Analytics Data
Research
Data
Counters
Reference
Data
In Memory Data Grid
(via Storm Trident State plug-in)
In Memory Data Stream
(Via Storm Spout Plugin)

In Memory Streaming Word Count with Storm
Storm has a simple builder
interface to creating
stream processing
topologies
Storm delegates
persistence to external
providers

Integrating with Hadoop, NoSQL DB..
Bolt
Bolt
Spout
Web
Activity
Web
Activity
Web
Activity
Analytics Data
Research
Data
Counters
Reference
Data
In Memory Data GridIn Memory Data Stream Storm Plugin
Hadoop, NoSQL, RDBMS,…
Write Behind
LRU based Policy

Live Demo – Word Count At In Memory Speed

Recent Benchmarks..
Gresham Computing plc, achieved over 50,000
equity trade transactions per second of load and match into
a database.

References
 Try the Cloudify recipe
 Download Cloudify : http://www.cloudifysource.org/
 Download the Recipe (apps/xapstream, services/xapstream):
– https://github.com/CloudifySource/cloudify-recipes
 XAP – Cassandra Interface Details;
 http://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency
 Check out the source for the XAP Spout and a sample state
implementation backed by XAP, and a Storm friendly streaming
implemention on github:
 https://github.com/Gigaspaces/storm-integration
 For more background on the effort, check out my recent blog posts at
http://blog.gigaspaces.com/
 http://blog.gigaspaces.com/gigaspaces-and-storm-part-1-storm-clouds/
 http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/
 Part 3 coming soon.

Real-Time Big Data at In-Memory Speed, Using Storm

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Real-Time Big Data at In-Memory Speed, Using Storm (20)

More from Nati Shalom (20)

Recently uploaded (20)

Real-Time Big Data at In-Memory Speed, Using Storm

Editor's Notes